← Back to the bonus vault
Open-source AI reference architecture diagram

Appendix B — The Open-Source / Self-Hosted Reference Architecture

This is the stack I'd hand an IT director who says "we can't put our data in someone else's cloud" — or who just wants to own the thing rather than rent it. Every tool here is an actively maintained open-source project as of writing (June 2026). Tools move fast: version numbers, and sometimes whole projects, change. Check the companion site for the current cut before you commit. The architecture — the layers and what each one does — is the durable part. The specific tool in each box is the swappable part.

You don't need all of this. Most mid-market firms need four or five boxes, not nineteen. The diagram shows the full shape so you can mark what you already have and what's missing. Start with the model gateway and the observability layer — those two earn their keep first.


B.1 — The stack, by layer

Layer 1 — Model gateway / routing

Sits in front of every model so you can switch providers, cap spend, and route the easy work to a cheap model.

Layer 2 — Inference serving (self-hosted models)

Where your open-weight models actually run when you're keeping data in-house.

Open-weight model landscape as of writing: Llama 4 (Meta), Qwen 3.5 (Alibaba), DeepSeek V4 (DeepSeek), Gemma 4 (Google), Mistral Large. Which one leads changes month to month — check a live leaderboard (e.g., LLM-Stats.com) rather than trusting a number printed in a book.

Layer 3 — Orchestration (the agent framework)

Defines how agents take steps, hand off, and stay in their lane.

Layer 4 — Memory / vector storage

The knowledge base that lives outside the model. The context window is a scratchpad, not memory.

Layer 5 — RAG / retrieval quality

Getting the right facts into the context window. This is where most RAG systems actually fail — on retrieval, not generation.

Layer 6 — Observability

If you can't see what the agent did, you can't trust it, debug it, or cost it. Stand this up first, before you scale anything.

Layer 7 — Evals

The test suite for an AI feature. Error analysis on real traces, then automated checks in CI. The highest-impact discipline most firms skip.

Layer 8 — Durable execution

Keeps a long-running agent workflow alive through a crash, a timeout, or a restart on a Tuesday night. The difference between a demo and a system.

Layer 9 — The integration standards (the plumbing)

Not products you install so much as standards your tools should speak.

Layer 0 — Data engineering (the prerequisite nobody photographs)

None of the above matters if the data underneath is a mess. This layer comes first in reality even though it's last on the diagram.


B.2 — The stack diagram (printable)

                         YOUR APPLICATION / WORKFLOW
                                    |
  ============================================================================
  | LAYER 6 — OBSERVABILITY (wraps everything)                                |
  |   Langfuse  /  Arize Phoenix      <- traces, cost, latency, eval hooks    |
  ============================================================================
                                    |
  +------------------------------------------------------------------------+
  | LAYER 3 — ORCHESTRATION                                                 |
  |   LangGraph  /  CrewAI            <- agents, steps, handoffs, HITL gates |
  +------------------------------------------------------------------------+
       |                    |                        |               |
       v                    v                        v               v
  +-----------+   +------------------+   +---------------------+  +-----------+
  | LAYER 1   |   | LAYER 4          |   | LAYER 7             |  | LAYER 8   |
  | GATEWAY   |   | MEMORY / VECTOR  |   | EVALS               |  | DURABLE   |
  | LiteLLM   |   | pgvector         |   | Promptfoo / Ragas   |  | EXECUTION |
  | OpenRouter|   | Qdrant           |   | DeepEval / DSPy     |  | Temporal  |
  +-----------+   +------------------+   +---------------------+  | Inngest   |
       |              (LAYER 5: RAG quality                       | DBOS      |
       v               rides on Layers 3+4)                       +-----------+
  +-----------+
  | LAYER 2   |
  | INFERENCE |
  | vLLM      |  <- self-hosted open-weight models
  | Ollama    |     (Llama 4 / Qwen 3.5 / DeepSeek V4 / Gemma 4 / Mistral)
  +-----------+

  ----- LAYER 9: STANDARDS spoken across the stack -----
        MCP (tools/data)  .  A2A (agent-to-agent)

  ============================================================================
  | LAYER 0 — DATA ENGINEERING (the foundation everything sits on)            |
  |   ingestion . transformation . warehouse . governance                    |
  ============================================================================

B.3 — How to read this if you're the one who has to build it

Mark every box you already have. For most mid-market firms, Layer 0 (data) and Layer 6 (observability) are the two that are missing or half-built — and they're the two that decide whether the rest works.

Don't start by buying an orchestration framework. Start by putting LiteLLM in front of whatever model you're already calling (so you can cap spend and switch providers without rewriting code), and Langfuse around it (so you can see what's happening). That's two boxes, a weekend of work, and it pays for itself the first time a runaway loop would have run up a bill you couldn't see.

The standards in Layer 9 aren't optional homework. Put the three questions in your vendor RFP: Is it MCP-compatible? A2A-aware? Does it survive a crash? The answers sort the systems from the demos faster than any feature list.

Want a second set of eyes on this in your firm? The no-sell promise applies: if it isn't a fit, I'll tell you in the first ten minutes.

Book a 30-Minute Call →