Building AI into your product in 2026 means navigating a complex landscape of providers, frameworks, and deployment options. Do you call OpenAI directly? Use a framework like LangChain? Self-host an open model? The right answer depends on your constraints around cost, latency, privacy, and operational complexity.

This guide walks through the major architectural patterns, compares their tradeoffs, and concludes with a complete design for a real-world Document Q&A system.

The Landscape at a Glance

Let's examine each pattern in detail.

Pattern 1: Direct Provider APIs

The simplest approach: call OpenAI, Anthropic, or Google directly. You send HTTP requests to their APIs and receive model outputs.

Implementation Example

When to Use Direct APIs

Pros Cons
Simplest implementation Vendor lock-in
First access to new features Separate integrations per provider
Best latency (no intermediary) No fallback if provider is down
Official SDKs well-maintained Cost optimization is manual

Best for: Teams committed to a single provider, applications needing cutting-edge features, prototypes and MVPs.

Pattern 2: Unified API Gateways

Services like OpenRouter, Amazon Bedrock, and Azure OpenAI provide a single API that routes to multiple model providers.

Key Benefits

  • Model switching: Change models by updating a single parameter, no code changes required
  • Automatic fallbacks: If Claude is down, route to GPT-4 automatically
  • Unified billing: One invoice regardless of which providers you use
  • A/B testing: Compare model performance in production

Tradeoffs

Consideration Impact
Latency ~10-15% higher (additional hop)
Data privacy Requests pass through intermediary
Feature parity New provider features may lag
Cost Small markup over direct pricing

Best for: Teams wanting vendor flexibility, production systems needing fallback reliability, organizations comparing multiple models.

Pattern 3: LLM Frameworks

Frameworks like LangChain, LlamaIndex, Haystack, and DSPy provide abstractions for common LLM patterns: chains, agents, retrieval, memory.

Framework Comparison

Framework Strength Best For
LangChain Ecosystem (100+ integrations) Complex, tool-augmented workflows
LlamaIndex Data ingestion and indexing RAG over large document sets
Haystack Production-ready pipelines Document QA in production
DSPy Minimal overhead, optimized prompts Performance-critical applications

Performance Benchmarks

Recent benchmarks running identical agentic RAG workflows across frameworks reveal significant differences in overhead:

LangChain Example

Framework Tradeoffs

Pros Cons
Rapid prototyping Abstraction overhead
Pre-built integrations Harder to debug (magic)
Standard patterns (RAG, agents) Version churn (breaking changes)
Active communities Learning curve for framework idioms

Best for: Teams building complex LLM pipelines, RAG applications, agent-based systems, or anyone wanting proven patterns over custom code.

Pattern 4: Self-Hosted Models

Run open-weight models (Llama, Mistral, Qwen) on your own infrastructure. This gives you complete control over data privacy and can reduce costs at scale.

Inference Server Comparison

Tool Strength Use Case
vLLM Highest throughput (PagedAttention) Production multi-user serving
Ollama Simplest setup Development, single-user
llama.cpp CPU inference, low VRAM Edge devices, limited GPU
TGI Hugging Face ecosystem HF model serving
LocalAI OpenAI-compatible, Docker-native Drop-in OpenAI replacement

Performance: vLLM vs Ollama vs llama.cpp

Running Ollama Locally

vLLM for Production

Self-Hosted Tradeoffs

Pros Cons
Complete data privacy GPU infrastructure costs (~$2-3K/mo cloud)
No per-token costs Model quality gap vs GPT-4/Claude
Full customization (fine-tuning) Operational complexity
Works offline Keeping up with model releases

Best for: High-volume applications where API costs would be prohibitive, privacy-sensitive data, offline/air-gapped environments, teams wanting full control.

Pattern 5: RAG (Retrieval-Augmented Generation)

RAG is less a deployment pattern and more an architectural pattern that can be combined with any of the above. It augments LLM responses with retrieved context from your own data.

Vector Database Options

Database Type Best For
Pinecone Managed Zero-ops, serverless scale
Qdrant Open source (Rust) Complex filtering, self-hosted
Weaviate Open source (Go) Multi-modal, GraphQL API
pgvector PostgreSQL extension Teams already on PostgreSQL
Chroma Open source (Python) Rapid prototyping
Milvus Open source Billion-scale vectors

pgvector vs Dedicated Vector DBs

Recent benchmarks show pgvectorscale (Timescale's enhanced pgvector) achieving 471 QPS at 99% recall on 50M vectors—11x better than Qdrant and competitive with Pinecone, at ~75% lower cost when self-hosted.

Choose pgvector when: You're already on PostgreSQL, want unified data management, or need transactional consistency between metadata and vectors.

Choose dedicated vector DB when: You need specialized features (billion-scale, complex filtering), want managed infrastructure, or have extreme performance requirements.

Decision Framework

Canonical Architecture: Document Q&A Repository

Let's design a production system for a fictional company, Acme Corp, that needs to let employees ask questions about their internal documentation: policy manuals, technical docs, HR guides, and product specs.

Requirements

  • ~50,000 documents (PDFs, Word docs, Markdown, HTML)
  • ~500 daily active users, ~2,000 queries/day
  • Must cite sources in answers
  • Confidential data (cannot use third-party APIs without approval)
  • Needs to be accurate—wrong answers are worse than no answer
  • Budget: moderate (not unlimited, not shoestring)

Architecture Overview

Technology Choices

Component Choice Rationale
Document Parsing Unstructured.io Handles PDFs, Word, HTML; extracts tables
Chunking Semantic chunking Respects document structure
Embeddings text-embedding-3-small Good quality, low cost ($0.02/1M tokens)
Vector Store PostgreSQL + pgvector Unified data layer, transactional consistency
Reranker bge-reranker-v2-m3 Open source, runs locally, high quality
LLM (Option A) Claude Sonnet via API Best quality, approved for internal data
LLM (Option B) Llama 3.2 70B via vLLM Self-hosted, no data leaves premises
Framework LlamaIndex Best for document indexing and retrieval
API Layer FastAPI Async, streaming support, OpenAPI docs

Key Design Decisions

1. Hybrid Search

Vector search alone misses keyword matches. We combine:

  • Vector similarity (semantic meaning)
  • BM25 / full-text search (exact keywords)

PostgreSQL's pg_trgm extension provides fast trigram search alongside pgvector.

2. Reranking

Initial retrieval returns ~20 candidates. A cross-encoder reranker (which is more accurate but slower than bi-encoders) re-scores these to pick the best 5. This dramatically improves answer quality.

3. Citation Enforcement

The prompt explicitly requires citing sources by number. We validate that all citation numbers correspond to retrieved documents.

4. Confidence / Abstention

The prompt instructs the model to say "I don't know" rather than hallucinate. We also monitor retrieval scores—if all retrieved documents have low similarity, we proactively warn the user.

Cost Estimate (2,000 queries/day)

Implementation Skeleton

Conclusion

There's no single "best" AI architecture—only the best architecture for your specific constraints. Here's my decision framework:

  • Just getting started? Direct API (Anthropic or OpenAI). Simplest path to production.
  • Want flexibility? OpenRouter or similar gateway. Easy model switching without code changes.
  • Building RAG or agents? LlamaIndex for data-heavy apps, LangChain for complex orchestration, DSPy for performance.
  • Privacy-critical or high-volume? Self-host with vLLM. Higher upfront cost, lower marginal cost.
  • Client-side only? WebGPU + Transformers.js. See our browser AI post.

The AI infrastructure landscape is evolving rapidly. The patterns in this guide will likely look different in a year. What won't change: the need to match your architecture to your actual constraints, not hypothetical ones.

Build the simplest thing that works. Measure. Iterate.

Further Reading