Appendix B — The Open-Source / Self-Hosted Reference Architecture
This is the stack I'd hand an IT director who says "we can't put our data in someone else's cloud" — or who just wants to own the thing rather than rent it. Every tool here is an actively maintained open-source project as of writing (June 2026). Tools move fast: version numbers, and sometimes whole projects, change. Check the companion site for the current cut before you commit. The architecture — the layers and what each one does — is the durable part. The specific tool in each box is the swappable part.
You don't need all of this. Most mid-market firms need four or five boxes, not nineteen. The diagram shows the full shape so you can mark what you already have and what's missing. Start with the model gateway and the observability layer — those two earn their keep first.
B.1 — The stack, by layer
Layer 1 — Model gateway / routing
Sits in front of every model so you can switch providers, cap spend, and route the easy work to a cheap model.
- LiteLLM — one API in front of 100+ model providers; the spend cap, routing, and fallback layer. (github.com/BerriAI/litellm)
- OpenRouter — hosted multi-provider routing if you'd rather not self-host the gateway. (openrouter.ai)
Layer 2 — Inference serving (self-hosted models)
Where your open-weight models actually run when you're keeping data in-house.
- vLLM — high-throughput production inference serving; the workhorse for self-hosted models at volume. (vllm.ai)
- Ollama — the easy on-ramp; run an open-weight model on a laptop or a single box to prototype. (ollama.ai)
Open-weight model landscape as of writing: Llama 4 (Meta), Qwen 3.5 (Alibaba), DeepSeek V4 (DeepSeek), Gemma 4 (Google), Mistral Large. Which one leads changes month to month — check a live leaderboard (e.g., LLM-Stats.com) rather than trusting a number printed in a book.
Layer 3 — Orchestration (the agent framework)
Defines how agents take steps, hand off, and stay in their lane.
- LangGraph — graph-based orchestration for multi-agent and stateful workflows; the most common production choice. (langchain.com/langgraph)
- CrewAI — role-based multi-agent framework, lighter to start with. (crewai.com)
Layer 4 — Memory / vector storage
The knowledge base that lives outside the model. The context window is a scratchpad, not memory.
- pgvector — vector storage inside the PostgreSQL you probably already run; the "you don't need a new database" option. (github.com/pgvector/pgvector)
- Qdrant — purpose-built vector database for when scale or filtering outgrows pgvector. (qdrant.tech)
Layer 5 — RAG / retrieval quality
Getting the right facts into the context window. This is where most RAG systems actually fail — on retrieval, not generation.
- Built on Layers 3 + 4 (the orchestration framework drives retrieval; the vector store holds the embeddings). The discipline lives in chunking, embedding choice, and hybrid retrieval — not in a single tool. Evaluate retrieval quality with the evals layer below.
Layer 6 — Observability
If you can't see what the agent did, you can't trust it, debug it, or cost it. Stand this up first, before you scale anything.
- Langfuse — open-source LLM observability: traces, costs, latency, eval hooks. (langfuse.com)
- Arize Phoenix — open-source alternative for tracing and evaluation. (github.com/Arize-ai/phoenix)
Layer 7 — Evals
The test suite for an AI feature. Error analysis on real traces, then automated checks in CI. The highest-impact discipline most firms skip.
- Promptfoo — eval and red-teaming for prompts and agents; runs in CI. (promptfoo.dev)
- Ragas — RAG-specific evaluation (faithfulness, answer relevancy, context precision). (docs.ragas.io)
- DeepEval — broader LLM evaluation framework with a pytest-style feel. (github.com/confident-ai/deepeval)
- DSPy — programmatic prompt optimization when you want the system to tune prompts rather than hand-tuning them. (dspy.ai)
Layer 8 — Durable execution
Keeps a long-running agent workflow alive through a crash, a timeout, or a restart on a Tuesday night. The difference between a demo and a system.
- Temporal — the established durable-execution engine; OpenAI runs agentic workloads on it. (temporal.io)
- Inngest — serverless-first durable execution; lighter operational footprint. (inngest.com)
- DBOS — PostgreSQL-native durable execution; fewest moving parts if you're already on Postgres. (dbos.dev)
Layer 9 — The integration standards (the plumbing)
Not products you install so much as standards your tools should speak.
- MCP (Model Context Protocol) — the default way agents connect to tools and data; ask every vendor if they speak it. (modelcontextprotocol.io)
- A2A (Agent-to-Agent) — interoperability standard for agents talking to each other, now under the Linux Foundation. (a2a-protocol.org)
Layer 0 — Data engineering (the prerequisite nobody photographs)
None of the above matters if the data underneath is a mess. This layer comes first in reality even though it's last on the diagram.
- Your existing data stack — ingestion, transformation, the warehouse. Name the usual suspects (Fivetran / dbt / your warehouse) here. AI sits on top of clean, governed data or it sits on sand.
B.2 — The stack diagram (printable)
YOUR APPLICATION / WORKFLOW
|
============================================================================
| LAYER 6 — OBSERVABILITY (wraps everything) |
| Langfuse / Arize Phoenix <- traces, cost, latency, eval hooks |
============================================================================
|
+------------------------------------------------------------------------+
| LAYER 3 — ORCHESTRATION |
| LangGraph / CrewAI <- agents, steps, handoffs, HITL gates |
+------------------------------------------------------------------------+
| | | |
v v v v
+-----------+ +------------------+ +---------------------+ +-----------+
| LAYER 1 | | LAYER 4 | | LAYER 7 | | LAYER 8 |
| GATEWAY | | MEMORY / VECTOR | | EVALS | | DURABLE |
| LiteLLM | | pgvector | | Promptfoo / Ragas | | EXECUTION |
| OpenRouter| | Qdrant | | DeepEval / DSPy | | Temporal |
+-----------+ +------------------+ +---------------------+ | Inngest |
| (LAYER 5: RAG quality | DBOS |
v rides on Layers 3+4) +-----------+
+-----------+
| LAYER 2 |
| INFERENCE |
| vLLM | <- self-hosted open-weight models
| Ollama | (Llama 4 / Qwen 3.5 / DeepSeek V4 / Gemma 4 / Mistral)
+-----------+
----- LAYER 9: STANDARDS spoken across the stack -----
MCP (tools/data) . A2A (agent-to-agent)
============================================================================
| LAYER 0 — DATA ENGINEERING (the foundation everything sits on) |
| ingestion . transformation . warehouse . governance |
============================================================================
B.3 — How to read this if you're the one who has to build it
Mark every box you already have. For most mid-market firms, Layer 0 (data) and Layer 6 (observability) are the two that are missing or half-built — and they're the two that decide whether the rest works.
Don't start by buying an orchestration framework. Start by putting LiteLLM in front of whatever model you're already calling (so you can cap spend and switch providers without rewriting code), and Langfuse around it (so you can see what's happening). That's two boxes, a weekend of work, and it pays for itself the first time a runaway loop would have run up a bill you couldn't see.
The standards in Layer 9 aren't optional homework. Put the three questions in your vendor RFP: Is it MCP-compatible? A2A-aware? Does it survive a crash? The answers sort the systems from the demos faster than any feature list.
Want a second set of eyes on this in your firm? The no-sell promise applies: if it isn't a fit, I'll tell you in the first ten minutes.
Book a 30-Minute Call →