AI Infrastructure by Vitale Mazo
34 min read
0 views

Building a Memory-Driven AI Homelab: DGX Spark, Knowledge Graphs, and 20 Containers From Soup to Nuts

A surgical deep-dive into running an NVIDIA DGX Spark with K3s, multi-agent AI orchestration, three-layer persistent memory (QMD vector search, Graphiti knowledge graph, MuninnDB cognitive memory), and 20 Docker containers on Unraid — all wired together with MCP servers, HashiCorp Vault, Cloudflare Access, and a custom API layer.

Building a Memory-Driven AI Homelab: DGX Spark, Knowledge Graphs, and 20 Containers From Soup to Nuts
Click to view full size
#AI #Homelab #DGX Spark #Knowledge Graph #MCP #OpenClaw #Unraid #OPNsense #Claude #LLM #Infrastructure #Automation #FalkorDB #Vault #Auth0 #Cloudflare Access #QMD #MuninnDB #Vector Search #Embeddings

Homelab Architecture

Deep-dives into the evolving architecture of a memory-driven AI homelab

Part 1 of 3 33% Complete
1 Building a Memory-Driven AI Homelab: DGX Spark, Knowledge Graphs, and 20 Containers From Soup to Nuts (Current)

Building a Memory-Driven AI Homelab: DGX Spark, Knowledge Graphs, and 20 Containers From Soup to Nuts

Most homelabs stop at Plex and Pi-hole. This one runs an NVIDIA DGX Spark with a K3s cluster, a multi-agent AI system with four specialized sub-agents, three layers of persistent memory — a temporal knowledge graph, a cognitive memory database, and a workspace search engine running local GGUF models — 20 Docker containers on Unraid, an OPNsense firewall managing TLS termination for 16 services, and a secret management layer that would make an enterprise security team nod approvingly.

This post documents every layer of the architecture — physical hardware, network topology, container orchestration, AI agent routing, the three-tier memory system, and the MCP server mesh that ties it all together. No hand-waving. No “just deploy this Helm chart.” Every IP address, every config decision, every hack required to make Claude talk to a graph database through an OpenAI-compatible proxy.

The 30-Second Overview

System Architecture Overview: Three layers — Physical (OPNsense, Unraid Tower, DGX Spark with K3s), AI Agent (Agent-API, OpenClaw), and Memory + Data (Graphiti, TEI Embeddings, Vault, Cloudflare Access)

Three hosts. Twenty-one containers. Three cloud LLM providers. Three memory systems. Cloudflare Access protecting external services.


Part 1: The Hardware

NVIDIA DGX Spark — K3s Cluster and AI Compute

The DGX Spark (hostname: spanky1, IP: 10.0.128.196) is the compute backbone. It’s a Grace Blackwell GB10 with 128GB of unified memory running Ubuntu 24.04 on ARM64, now managed by a K3s cluster with ArgoCD GitOps (see Part 2 of this series for the full migration story).

The K3s cluster runs platform services — ArgoCD, Grafana, Prometheus, MetalLB, External Secrets Operator — with vLLM model deployments available as Kubernetes pods. The vLLM deployments for Qwen3-32B and Qwen2.5-7B are currently scaled to 0 replicas since the Agent-API has been simplified to use cloud-only LLM providers (Groq and OpenRouter), and OpenClaw runs entirely on Claude. The GPU is available for on-demand inference when needed — scaling up is a single kubectl scale or git commit.

Whisper remains as a systemd service handling speech-to-text on CPU with int8 quantization.

Unraid Tower — The Container Mothership

The Unraid NAS (tower.local.lan, 10.0.128.2) runs all 20 Docker containers across a br0 macvlan network. Every container gets its own IP on the 10.0.3.0/16 subnet, communicating directly at Layer 2 without NAT.

OPNsense — Firewall, DNS, TLS Termination

OPNsense (10.0.1.2) handles routing between subnets, Kea DHCPv4 for all leases, Unbound DNS for local resolution, WireGuard tunnels, and — critically — Caddy reverse proxy for TLS termination of all 16 internal services.

Every *.int.vitalemazo.com domain terminates TLS at Caddy on OPNsense using ACME certificates with Cloudflare DNS challenge. No self-signed certs. No certificate warnings.


Part 2: Network Architecture

Subnet Layout

┌─────────────────────────────────────────────┐
│              Network Topology                │
│                                              │
│  10.0.1.0/24   ─── OPNsense management      │
│  10.0.3.0/16   ─── br0 macvlan (containers) │
│  10.0.5.0/24   ─── IoT devices              │
│  10.0.128.0/24 ─── Compute (DGX Spark)      │
└─────────────────────────────────────────────┘

The br0 Macvlan — Every Container Is a First-Class Citizen

All 20 containers on Unraid share the br0 macvlan network. Each gets a unique 10.0.3.x IP address. This means:

  • Containers communicate directly at Layer 2 — no Docker bridge NAT
  • Each container is addressable by IP from anywhere on the network
  • OPNsense firewall rules do not apply to same-subnet L2 traffic
  • Security between containers is application-layer: bearer tokens, API keys, IP allowlists

This is a deliberate tradeoff. Macvlan gives clean networking and easy addressability at the cost of no implicit inter-container firewall. For a homelab where every service is authenticated, that’s acceptable.

IP Assignment Map

AI / Agent Stack                    Infrastructure
──────────────────                  ─────────────────────
10.0.3.85  Agent-API                10.0.3.75  Vault
10.0.3.88  Graphiti + FalkorDB      10.0.3.25  Home Assistant
10.0.3.89  TEI Embeddings           10.0.3.20  Mosquitto MQTT
10.0.3.90  CLI Proxy API            10.0.3.21  RYSE MQTT Bridge
10.0.3.91  MuninnDB                 10.0.3.30  Docker Registry
                                    10.0.3.31  Registry UI
                                    10.0.3.66  Cloudflared Tunnel
Media Stack
──────────────────                  K3s / Compute (DGX Spark)
10.0.3.13  Plex                     ─────────────────────
10.0.3.11  Sonarr                   10.0.128.196  K3s Node
10.0.3.10  Radarr                   10.0.128.200  ArgoCD
10.0.3.9   Prowlarr                 10.0.128.201  Grafana
10.0.3.8   Overseerr                10.0.128.203  OpenClaw
10.0.3.5   Deluge
10.0.3.12  FlareSolverr

Caddy Reverse Proxy — 16 Services, One Wildcard

Caddy on OPNsense terminates TLS for every internal service:

vault.int.vitalemazo.com       → 10.0.3.75:8200
ha.int.vitalemazo.com          → 10.0.3.25:8123
plex.int.vitalemazo.com        → 10.0.3.13:32400
agent.int.vitalemazo.com       → 10.0.3.85:8888
openclaw.int.vitalemazo.com    → 10.0.128.203:18789 (K3s)
argo.int.vitalemazo.com        → 10.0.128.200 (ArgoCD on K3s)
grafana-spark.int.vitalemazo.com → 10.0.128.201 (Grafana on K3s)
...and 10 more

External access to sensitive services (registry, ArgoCD, Grafana, DGX dashboard) goes through Cloudflare Access with Auth0 SSO. Internal LAN access terminates TLS at Caddy and relies on application-layer authentication.


Part 3: External Access — Cloudflare Access + Auth0 SSO

Early iterations of this stack used an API gateway (Tyk OSS) to consolidate routing, auth header injection, and protocol translation. As the architecture matured, that complexity proved unnecessary — Cloudflare Access handles authentication at the edge, and services are accessed directly through the Cloudflare Tunnel with per-hostname Access Applications.

The Simplified Architecture

External User → Cloudflare Tunnel → Cloudflare Access (Auth0 SSO)
  → Direct to backend service (no gateway intermediary)

Each externally-exposed service gets its own Cloudflare Access Application with Auth0 as the identity provider. Users authenticate once through Auth0’s login page (Google social connection), and Cloudflare issues a 24-hour session token. Only authenticated requests reach the backend.

Access-Protected Services

ServiceExternal URLBackendAuth
Docker Registry APIregistry.vitalemazo.com10.0.3.30:5000Cloudflare Access + Auth0
Docker Registry UIregistry-ui.vitalemazo.com10.0.3.31:80Cloudflare Access + Auth0
ArgoCDargo.vitalemazo.com10.0.128.200Cloudflare Access + Auth0
Grafanagrafana-spark.vitalemazo.com10.0.128.201Cloudflare Access + Auth0
Traefik Dashboardtraefik-spark.vitalemazo.com10.0.128.202Cloudflare Access + Auth0
DGX Dashboarddgx.vitalemazo.com10.0.128.196:11001Cloudflare Access + Auth0
OpenClawopenclaw.vitalemazo.com10.0.128.203:18789 (K3s)Cloudflare Access + Auth0

What Got Removed

The API gateway layer (Tyk OSS + Redis) was removed entirely. Three containers eliminated:

Removed ContainerWhat Replaced It
Tyk Gateway (10.0.3.40)Cloudflare Access per-hostname auth + direct tunnel routing
Tyk Redis (10.0.3.41)No session storage needed — Cloudflare manages sessions at the edge
Agent-Chat Web UI (10.0.3.86)OpenClaw is now the sole chat interface

Registry authentication previously required basic auth header injection via the gateway. Now the registry runs without htpasswd — Cloudflare Access ensures only authenticated users reach it. The registry UI gets its own Access Application so both registry.vitalemazo.com and registry-ui.vitalemazo.com are independently protected.

Internal Access — OPNsense Caddy

For LAN access, nothing changed. OPNsense’s Caddy reverse proxy terminates TLS for every *.int.vitalemazo.com service using ACME certificates with Cloudflare DNS challenge. Internal access doesn’t require Auth0 — services handle their own application-layer authentication.


Part 4: The Multi-Agent AI System

Agent-API — The Brain Router

The Agent-API (10.0.3.85:8888) is a custom Python application built on PydanticAI that routes every user query to the right specialist.

Multi-Agent Router: User query flows to keyword-based classifier which dispatches to four specialized sub-agents — Infrastructure (Groq Scout, 25 tools), Home (Groq, 8 tools), GitHub (GPT-OSS, MCP Server), and General (GPT-OSS, 10 tools)

The router is keyword-based. Earlier versions used a local Qwen2.5-7B model for intent classification, but since the Agent-API no longer depends on local LLMs, the router now uses simple keyword matching — pattern rules that classify queries like “turn on the lights” to the Home agent and “what’s running on tower” to Infrastructure. This eliminates the vLLM dependency entirely and means the Agent-API starts instantly with zero GPU requirements.

Each sub-agent has a cloud-only fallback chain. If the primary model is unreachable or returns an error, the agent automatically retries with the next provider:

AgentPrimaryFallbackTools
InfrastructureGroq (Llama 4 Scout)GPT-OSS-120B (OpenRouter)SSH, OPNsense API (10 tools), Terraform, Docker Registry, Cloudflare DNS
HomeGroq (Llama 4 Scout)GPT-OSS-120BHA entity control, state queries, automations, history
GitHubGPT-OSS-120BGroqGitHub MCP Server (repos, issues, PRs) — 131K context for large diffs
GeneralGPT-OSS-120BGroqTime, weather, ping, news, web search, Vault secrets

The Home agent has a regex fast path. Simple commands like “turn on the kitchen light” or “close the shades” bypass the LLM entirely — a regex parser extracts the action and entity, calls Home Assistant directly, and returns in under 500ms. The LLM only activates for complex queries like “which lights have been on for more than 2 hours?”

Why Cloud-Only for Agent-API?

The original design used local Qwen3-32B for Home and General agents, with cloud providers as fallbacks. In practice, OpenClaw (running Claude) handles all conversational and complex tasks. The Agent-API primarily handles structured automation — tool calls triggered by OpenClaw’s homelab-bridge skill, cron-driven tasks, and direct API calls. For these structured tasks, Groq’s Llama 4 Scout and OpenRouter’s GPT-OSS-120B provide excellent quality with sub-5-second latency, no GPU memory consumed, and instant startup.

The DGX Spark’s GPU is now free for on-demand inference workloads via K3s rather than being permanently allocated to always-on Agent-API models.

The Monkey-Patches

When you wire together models from Groq and OpenRouter through the OpenAI SDK, you hit compatibility issues:

  1. Groq returns service_tier: "on_demand" in chat completions. The OpenAI SDK’s Pydantic model rejects this. Fix: patch ChatCompletion.model_fields["service_tier"] to accept the value.

  2. Groq sends null tool arguments. GPT-OSS sends {"": {}} for parameterless tools. Neither is valid per the OpenAI spec. Fix: patch ToolManager._validate_tool_args to normalize both patterns.

These are two lines of monkey-patching that save hundreds of error-handling branches.

Authentication and Rate Limiting

Every Agent-API endpoint (except /api/health) requires a bearer token. Tokens are stored in HashiCorp Vault at secret/agent-api/keys — two keys: personal (for direct API access) and openclaw (for the OpenClaw platform).

Rate limiting: 30 requests/minute per key, maximum 2 concurrent requests per key. Sessions expire after 2 hours or 20 messages per agent history.


Part 5: OpenClaw — The Agent Platform

OpenClaw (10.0.128.203, K3s) is the user-facing platform and the primary chat interface for the entire homelab. It provides a web chat UI, agent lifecycle management, skill systems, cron-driven autonomous behaviors, and a three-layer memory system that gives agents persistent recall across sessions. OpenClaw is entirely Claude-powered — every conversation, every tool call, every reasoning step runs through Anthropic’s Claude API via an OpenAI-compatible proxy. External access is protected by Cloudflare Access with Auth0 SSO.

OpenClaw Platform: Browser connects through Cloudflare Access and Caddy TLS to OpenClaw Gateway hosting Sparky (Claude Sonnet 4.6, 16 skills) and Dev (Claude Opus 4.6, 9 skills), both routing through cli-proxy-api to Anthropic Claude API

Three Agents, Different Roles

Sparky is the home and infrastructure assistant. It has 16 skills covering everything from controlling Sonos speakers and RYSE window shades to querying OPNsense firewall rules and managing Unraid containers. It runs on Claude Sonnet 4.6 for a balance of speed and capability, and has a heartbeat that triggers every 30 minutes during waking hours (8am–11pm) for proactive monitoring.

Dev is the software development agent. It runs on Claude Opus 4.6 for maximum reasoning capability and has 9 skills covering autonomous coding loops (dev-loop), project bootstrapping, Excalidraw diagram generation, knowledge graph access, GitHub integration, and Vault secrets management. Sandbox is completely off — it has full read, write, edit, and exec access to its workspace.

DevOps is the infrastructure automation agent. Also running on Claude Opus 4.6, it has full tool access and executes commands on remote nodes — including the Mac workstation via OpenClaw’s WebSocket node pairing. It handles deployments, container management, CI/CD pipelines, and infrastructure-as-code operations. Its workspace is isolated from both Sparky and Dev to prevent cross-contamination of operational and development contexts.

The Skill System

Skills are markdown files (SKILL.md) that teach agents how to use specific tools. Sparky’s 16 workspace skills:

  • homelab-bridge: Proxies requests to the Agent-API for infrastructure/HA/GitHub operations
  • knowledge-graph: Stores and retrieves facts from the Graphiti temporal knowledge graph
  • opnsense: Queries the OPNsense REST API for firewall rules and DHCP leases
  • ryse-shades: Controls RYSE SmartBridge window shades (with the workaround that close_cover doesn’t work — only set_cover_position to 0)
  • vault-secrets: CRUD operations on HashiCorp Vault secrets
  • sonoscli: Speaker control (play, pause, volume, grouping)
  • proactive-agent: Autonomous behavior triggered by cron heartbeats
  • self-improving-agent: Learns from errors and corrections to improve future responses
  • caldav-calendar: CalDAV calendar integration for scheduling
  • excalidraw: Architecture diagram generation as .excalidraw files
  • unraid: Docker container management on Unraid
  • weather: Weather queries and forecasts
  • web-search: Internet search capabilities
  • muninn-memory: MuninnDB cognitive memory — remember, recall, and reason over past experiences
  • find-skills: Discovers and loads additional skills from the global skills directory
  • github: GitHub repository operations

There are also 11 global skills shared across both agents covering Terraform, Kubernetes, Docker, and development patterns.

The Claude Proxy — Why OpenClaw Doesn’t Use Local LLMs

This is a common question: why doesn’t OpenClaw use local LLMs?

The Agent-API uses cloud providers (Groq Llama 4 Scout, OpenRouter GPT-OSS-120B) with a keyword-based router — no local models at all. Its sub-agents handle structured tasks (classify intent, call tool, return result) that fast cloud models handle well.

OpenClaw is different. It’s a full conversational AI platform with compaction, memory flush, multi-turn reasoning, and skill orchestration. These capabilities demand Claude-class reasoning. Both agents talk to Claude through cli-proxy-api at 10.0.3.90:8317 — an OpenAI-compatible proxy that translates requests from OpenAI’s API format to Anthropic’s native format and forwards them to Claude’s cloud API.

// OpenClaw model provider config (from openclaw.json)
{
  "providers": {
    "claude-proxy": {
      "baseUrl": "http://10.0.3.90:8317/v1",
      "api": "openai-completions",
      "models": [
        { "id": "claude-sonnet-4-6", "contextWindow": 200000, "maxTokens": 16384 },
        { "id": "claude-opus-4-6", "contextWindow": 200000, "maxTokens": 16384 }
      ]
    }
  }
}

The proxy (cli-proxy-api) is a lightweight Anthropic→OpenAI protocol translator running in its own container. OpenClaw sends requests as OpenAI-compatible chat completions; the proxy rewrites them as Anthropic messages API calls and streams the response back. No API key is shared with OpenClaw — the proxy holds the Anthropic credentials.


Part 6: The Memory Architecture — Three Layers of Persistent Recall

This is where it gets interesting. Most AI agents are stateless — every conversation starts from zero. This homelab gives agents three complementary memory systems that each solve a different recall problem. Together they provide workspace-level document search, structured knowledge with temporal relationships, and associative cognitive recall — all searchable in under 3 seconds.

Memory Architecture: Three layers — Workspace Memory (QMD with BM25, vector, and LLM reranker), Knowledge Graph (Graphiti with FalkorDB), and Cognitive Memory (MuninnDB with associative recall) — sharing TEI embeddings and cli-proxy-api infrastructure

Layer 1: Workspace Memory — QMD Search Backend

Every OpenClaw agent has a workspace full of markdown files: MEMORY.md, daily logs (memory/2026-03-13.md), session transcripts, and skill definitions. The question is: how do you search them effectively?

The built-in search was basic — keyword matching against filenames. QMD (Query Markup Documents) replaces it with a full retrieval pipeline running entirely on local GGUF models inside the container. Zero API calls. Zero cost per search.

QMD Search Pipeline: Query flows through Query Expansion (1.7B GGUF), then parallel BM25 keyword search and Vector search (300M GGUF), candidates merge, LLM Reranker (0.6B GGUF) scores relevance, MMR ensures diversity, temporal decay weights recency, top 6 results injected into agent context

Here’s what happens when an agent searches memory:

  1. Query Expansion: A 1.7B parameter GGUF model (qmd-query-expansion) decomposes the query into sub-queries. “DGX Spark network config” might expand to: “DGX Spark IP address”, “network routes compute subnet”, “vLLM configuration”
  2. Parallel Retrieval: Two search engines run simultaneously:
    • BM25 (SQLite FTS5) — keyword matching that catches exact values like IP addresses, hostnames, and config keys
    • Vector Search (embedding-gemma-300M GGUF) — semantic similarity for conceptual matches
  3. Candidate Fusion: Results from both paths are merged with a 4x candidate multiplier — retrieve 24 candidates to select the best 6
  4. LLM Reranker (qwen3-reranker-0.6B, Q8_0 GGUF) — scores each candidate for relevance and reorders by quality
  5. MMR Diversity (lambda 0.7) — prevents returning 6 near-identical chunks from the same document
  6. Temporal Decay (30-day half-life) — recent memories rank higher than stale ones
  7. Context Injection — top 6 results, capped at 5,000 characters, injected into the agent’s prompt

The three GGUF models total ~2.1GB in RAM:

ModelSizePurpose
embedding-gemma-300M~400MBVector embeddings for semantic search
qwen3-reranker-0.6B~640MBCross-encoder relevance scoring
qmd-query-expansion-1.7B~1.2GBQuery decomposition into sub-queries

QMD runs as an MCP HTTP daemon on port 8181 inside the OpenClaw container, started by an init wrapper script that handles installation, lifecycle management, and a watchdog that auto-restarts the daemon if it crashes. The daemon avoids the 15–19 second cold-start penalty that would otherwise hit every query — with the daemon running, searches complete in 1–3 seconds.

The fallback chain: If QMD is unavailable, OpenClaw falls back to a built-in hybrid search that combines BM25 with vector similarity via the TEI embeddings server (10.0.3.89:8080). This provides most of the retrieval quality (minus the reranker and query expansion) with zero local model dependencies.

Memory Flush — Pre-Compaction Persistence

OpenClaw agents have a 200K token context window, but long sessions eventually trigger compaction — the system compresses older messages to free up space. Without intervention, valuable context gets lost.

The memory flush system intercepts this:

{
  "compaction": {
    "mode": "safeguard",
    "reserveTokensFloor": 24000,
    "memoryFlush": {
      "enabled": true,
      "softThresholdTokens": 6000,
      "systemPrompt": "Session nearing compaction. Store durable memories now.",
      "prompt": "Write any lasting notes, decisions, or discovered facts to memory/YYYY-MM-DD.md."
    }
  }
}

When a session reaches ~170K tokens (200K minus the 24K reserve minus the 6K soft threshold), the agent receives a system prompt telling it to save important context to disk before compaction erases it. These saved notes become searchable by QMD in the next sync cycle (every 5 minutes).

Session Indexing — Past Conversations Are Searchable

Every past conversation transcript is indexed by QMD. An agent can recall what was discussed three weeks ago — “what did we decide about the MetalLB IP pool?” — because the session transcripts are part of the search corpus. Session retention is set to 90 days.

Layer 2: Knowledge Graph — Graphiti + FalkorDB

While QMD searches documents, Graphiti extracts and stores structured knowledge: entities, relationships, and temporal facts.

Knowledge Graph Architecture: Agent calls graphiti-cli which connects via MCP protocol to Graphiti MCP Server containing Episode Ingestion, Entity Extraction (Claude Sonnet 4.6), Semantic Search with TEI Embeddings (10.0.3.89), and FalkorDB graph storage

When an agent learns something important — a deployment outcome, a user preference, an infrastructure fact — it calls graphiti-cli add with a text description and a group ID.

graphiti-cli add "Deployed Graphiti at 10.0.3.88 with FalkorDB \
  and TEI embeddings on March 6th 2026" infra

Here’s what happens in the next ~15 seconds:

  1. Episode creation: The text is stored as an episode in FalkorDB with a timestamp and group ID
  2. Entity extraction: Claude Sonnet 4.6 analyzes the text and extracts entities with types:
    • Graphiti → Organization
    • FalkorDB → Organization
    • 10.0.3.88 → Location
    • TEI embeddings → Topic
    • March 6th 2026 → Event
  3. Relationship extraction: Claude identifies relationships between entities:
    • Graphiti —deployed_at→ 10.0.3.88
    • Graphiti —uses→ FalkorDB
    • Graphiti —uses→ TEI embeddings
  4. Embedding generation: Each entity and relationship gets a 384-dimensional vector from the TEI server
  5. Graph storage: Nodes, edges, and vectors are persisted in FalkorDB

When an agent needs to recall information:

graphiti-cli search-facts "what database does Graphiti use" infra

This performs both semantic search (vector similarity via TEI embeddings) and graph traversal (following relationships in FalkorDB) to return relevant facts with temporal context.

Entity Types

The knowledge graph automatically categorizes extracted entities:

TypeDescriptionExamples
PreferenceUser choices and opinions”Prefers dark mode”, “Uses keyword router for Agent-API”
RequirementNeeds and specs”Must support 200K context”, “Needs FP8 quantization”
ProcedureWorkflows and commands”Delete wlan0 route after reboot”, “Deploy with docker run”
LocationPhysical and network locations”10.0.3.88”, “tower”, “DGX Spark”
EventDeployments, changes, incidents”Deployed March 6th”, “Fixed embedder base_url”
OrganizationServices and systems”FalkorDB”, “OpenClaw”, “Graphiti”
DocumentFiles and configs”config.yaml”, “deploy.sh”, “SOUL.md”
TopicConcepts and technologies”Temporal knowledge graph”, “macvlan networking”

Group IDs — Cross-Agent Memory

All three agents read and write to the same graph but tag episodes with different group IDs:

  • sparky — Sparky’s observations and decisions
  • dev — Dev’s coding context and project knowledge
  • devops — DevOps deployment and infrastructure knowledge
  • infra — Shared infrastructure facts

This means Dev can recall what Sparky learned about a network issue, DevOps can reference code decisions Dev made, and Sparky can look up what DevOps deployed last Tuesday. The knowledge graph is shared; the group IDs provide attribution and scoping for search.

The Patches That Made It Work

Graphiti’s MCP server is designed for native OpenAI APIs. Making it work with Claude through an OpenAI-compatible proxy required patching three Python files.

Problem 1: Embeddings routing. Graphiti uses the OpenAI SDK for embeddings, which picks up the OPENAI_BASE_URL environment variable. That points at the Claude proxy (10.0.3.90:8317), but embeddings need to go to the TEI server (10.0.3.89:8080). The factory code doesn’t pass base_url separately.

Fix: Patched factories.py to extract api_url from the embedder’s provider config and pass it explicitly to OpenAIEmbedderConfig(base_url=...).

Problem 2: Structured output validation. Graphiti uses OpenAI’s responses.parse() for structured output — schema validation happens inside the SDK before our code runs. Claude returns JSON wrapped in markdown code fences (```json ... ```), wrong field names (entities instead of extracted_entities), and bare lists instead of objects. All of these fail SDK validation.

Fix: Rewrote openai_client.py to use chat.completions.create() instead of responses.parse(). The JSON schema gets injected as text in the system prompt. A custom response parser strips code fences, remaps field names using fuzzy matching, and auto-wraps bare lists into the expected object structure by inspecting the Pydantic response model’s field types.

Problem 3: Small model fallback. Graphiti uses a “small model” (defaulting to gpt-4.1-mini) for lightweight operations. The Claude proxy doesn’t serve that model.

Fix: Patched factories.py to detect non-OpenAI model names and set small_model = config.model — use Claude for everything.

These three patched files are bind-mounted into the container, overriding the originals at runtime.

Layer 3: Cognitive Memory — MuninnDB

While QMD searches files and Graphiti stores structured facts, MuninnDB (10.0.3.91) provides associative cognitive memory — the kind of recall that mimics how humans connect ideas.

MuninnDB is a custom memory database built from a Rust binary with 33 MCP tools. Each agent has its own vault (namespace) — sparky, dev, devops, infra — with separate API keys stored in HashiCorp Vault.

The key operations:

ToolPurpose
muninn_rememberStore a memory with automatic embedding and LLM enrichment
muninn_recallRetrieve memories by associative similarity
muninn_decideAsk MuninnDB to reason over stored memories and make a recommendation
muninn_traverseWalk the memory graph following conceptual connections
muninn_remember_treeStore a hierarchical memory structure

When an agent calls muninn_remember, the text is:

  1. Embedded via TEI (10.0.3.89:8080) for vector search
  2. Enriched by Claude Sonnet 4.6 (via cli-proxy-api at 10.0.3.90:8317) — the LLM adds metadata, tags, and conceptual connections
  3. Stored in MuninnDB’s internal graph with bidirectional associations

The muninn_decide tool is unique — you ask it a question like “should we use Qwen3 or Claude for this task?” and it reasons over all relevant stored memories to produce a recommendation. This is cognitive recall, not just search.

MuninnDB runs on a custom Debian container (the binary is glibc-linked — Alpine doesn’t work). A socat layer forwards traffic from the container IP to the 127.0.0.1-bound binary:

Container IP (10.0.3.91)         Internal
─────────────────────            ─────────
:8475 (REST)   → socat →  127.0.0.1:8474
:8476 (Web UI) → socat →  127.0.0.1:8476
:8750 (MCP)    → native   0.0.0.0:8750

How the Three Layers Work Together

Each memory layer answers a different question:

QuestionLayerExample
”What’s in my notes about X?”QMD (workspace)“What did I write about the MetalLB IP pool config?"
"What are the facts about X?”Graphiti (knowledge)“What IP is Graphiti deployed on?"
"What should I do about X?”MuninnDB (cognitive)“Based on past deployments, should I use rolling or blue-green?”

They don’t compete — they complement. An agent might:

  1. QMD finds the session transcript where you discussed DNS configuration
  2. Graphiti retrieves the structured fact that OPNsense runs Unbound DNS at 10.0.1.2
  3. MuninnDB recalls that the last time someone changed DNS config without testing, resolution broke for 2 hours

The workspace memory runs automatically on every message (injected into the prompt). The knowledge graph and cognitive memory are invoked explicitly by the agent via skill-defined tools when it needs structured facts or associative reasoning.


Part 7: Secret Management with HashiCorp Vault

Every API key, token, and credential in this infrastructure lives in HashiCorp Vault (10.0.3.75).

Vault Architecture: HashiCorp Vault with AppRole auth stores secret paths for API keys, agent auth, HA tokens, Cloudflare, OpenClaw, and AWS credentials — consumed by Agent-API (auto-refresh), OpenClaw (scoped read-only 15m tokens), and Claude Code (full access)

No Hardcoded Secrets

The Agent-API authenticates to Vault using AppRole with automatic token refresh. At startup, it exchanges a Role ID and Secret ID for a renewable token (1-hour TTL, extendable to 4 hours). Every API key — Groq, OpenRouter, GitHub, Home Assistant, OPNsense, Cloudflare — is fetched from Vault at runtime.

OpenClaw gets scoped access through a special endpoint (/api/internal/token) on the Agent-API that mints short-lived Vault tokens with a readonly policy and 15-minute TTL. This endpoint is IP-restricted to OpenClaw’s K3s pod network.

Vault MCP Server

Claude Code (my local CLI) connects to Vault through an MCP server — a Go binary that provides read_secret, write_secret, list_secrets, and delete_secret tools, plus full PKI certificate management. This means I can say “store this API key in Vault” in a Claude Code session, and it happens without me ever touching the Vault UI.


Part 8: Home Automation Integration

Home Assistant + MQTT + RYSE Shades

Home Automation Flow: Agent-API Home Agent connects via REST to Home Assistant, which communicates through Mosquitto MQTT Broker to RYSE MQTT Bridge and finally to the physical RYSE SmartBridge on the IoT subnet

The Home Agent has 8 tools for interacting with Home Assistant via its REST API. The standout is ha_control — a combined find-and-control tool that uses fuzzy entity matching with difflib.SequenceMatcher. You can say “turn on the kitchen light” even if the entity is named light.kitchen_main_overhead — it’ll find the closest match.

The RYSE SmartBridge integration deserves special mention. The bridge controls motorized window shades but has a quirk: the standard close_cover service doesn’t work. The agent has learned (and stored in the knowledge graph) that only set_cover_position with position 0 reliably closes the shades. This is exactly the kind of operational knowledge that the temporal knowledge graph preserves across sessions.


Part 9: MCP Servers — The Connective Tissue

Model Context Protocol (MCP) servers provide tool interfaces that AI agents can discover and use. Seven MCP servers are configured across the system:

ServerRuntimePurpose
VaultGo binarySecret CRUD, PKI certificate management
SSHNative binaryRemote command execution on known hosts
BrowserNative binaryWeb page interaction and automation
GitHubStdio (in Agent-API)Repository, issue, and PR management
GraphitiHTTP (10.0.3.88:8000)Knowledge graph read/write via MCP protocol
MuninnDBHTTP (10.0.3.91:8750)Cognitive memory — 33 tools including remember, recall, decide, traverse
QMDHTTP (localhost:8181)Workspace memory search — BM25 + vector + reranker pipeline

OPNsense management is handled via SSH rather than a dedicated MCP server — the OPNsense REST API auth proved unreliable, so direct SSH with key-based authentication is the production approach. Sparky’s opnsense skill wraps SSH commands to query firewall rules, DHCP leases, and configuration.

MCP Transport: Stdio vs HTTP

Most MCP servers use stdio transport — they run as child processes that communicate over stdin/stdout. This is fine for single-client use (Claude Code on my Mac).

Graphiti uses Streamable HTTP transport — it’s a network service at 10.0.3.88:8000/mcp that multiple clients can connect to simultaneously. The graphiti-cli shell script handles the MCP session lifecycle: initialize a session (get a session ID from the response headers), call tools with that session ID, parse JSON-RPC responses.

# Simplified graphiti-cli flow
SESSION_ID=$(curl -si -X POST "$URL" \
  -d '{"jsonrpc":"2.0","method":"initialize",...}' \
  | grep -i "mcp-session-id:" | sed "s/^[^:]*: *//" | tr -d "\r\n")

curl -X POST "$URL" \
  -H "mcp-session-id: $SESSION_ID" \
  -d '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"add_episode",...}}'

Part 10: The Complete Data Flow

Here’s what happens when you type “Remember that the DGX Spark runs Qwen3-32B at 10.0.128.196” into the OpenClaw chat:

Complete Data Flow: Memory storage path from Browser through Cloudflare Access, Caddy, OpenClaw Gateway, Sparky Agent, graphiti-cli, Graphiti MCP Server, entity extraction via Claude, embedding via TEI, to FalkorDB storage — and recall path from query through semantic search and graph traversal back to the agent

Storage path (write): Browser → Cloudflare Access → Caddy (TLS) → OpenClaw Gateway → Sparky Agent → graphiti-cli → Graphiti MCP Server → Entity extraction (Claude Sonnet 4.6) + Embedding (TEI) → FalkorDB. ~15 seconds for entity extraction and storage.

Recall path (read): Browser → Cloudflare Access → Caddy → OpenClaw → Sparky → graphiti-cli → Graphiti MCP → TEI (embed query) → FalkorDB (vector similarity + graph traversal) → facts returned to Sparky. Under 3 seconds.

The round-trip for recall is under 3 seconds. Storage takes ~15 seconds due to the entity extraction LLM calls.

What’s less visible is what happens on every single message. Before the agent even sees your query, QMD runs a memory search against the workspace:

Your message arrives
  ├── QMD search triggers (automatic, every message)
  │   ├── Query expansion → 3 sub-queries
  │   ├── BM25 + Vector search → 24 candidates
  │   ├── Reranker → top 6 results
  │   └── 5,000 chars injected into prompt

  ├── Agent receives: your message + memory context
  │   ├── May invoke Graphiti (explicit): "what IP is X on?"
  │   ├── May invoke MuninnDB (explicit): "recall past decisions about X"
  │   └── Responds with full context from all layers

  └── If near compaction threshold:
      └── Memory flush → saves durable notes to disk
          └── QMD indexes them within 5 minutes

The workspace memory is passive — it enriches every interaction automatically. The knowledge graph and cognitive memory are active — the agent calls them when it needs structured facts or associative reasoning. This layered approach means the agent always has relevant workspace context, and can pull in deeper knowledge on demand.


Part 11: What This Enables

This isn’t infrastructure for its own sake. Here’s what the stack actually does in daily use:

“Turn off the office lights and close the shades” → Home Agent regex fast-path → Home Assistant → lights off in 500ms, then set_cover_position to 0 via RYSE MQTT bridge.

“What containers are running on tower?” → Infrastructure Agent → SSH to tower → docker ps → formatted response with status, IPs, and uptime.

“Create a WireGuard peer for my new laptop” → Infrastructure Agent → OPNsense API → new peer config generated and displayed.

“Review the latest PR on the agent-api repo” → GitHub Agent → GitHub MCP Server → PR diff fetched (131K context window handles large diffs) → detailed review with line-specific comments.

“What did we deploy last week?” → Sparky → Knowledge Graph → temporal query across episodes → list of deployments with dates, IPs, and outcomes.

“Remember that the wlan0 route on tower breaks DGX connectivity after reboot” → Knowledge Graph → stored as Procedure entity → recalled automatically next time DGX connectivity fails.

“Should we use blue-green or rolling deployment for this?” → DevOps → MuninnDB decide → reasons over past deployment memories → recommendation with rationale.

“What did we discuss about the DNS config last week?” → QMD workspace search → finds the session transcript → agent summarizes the relevant conversation with full context.

The three-layer memory system is the force multiplier. Without it, every session starts cold. With it, the agents accumulate operational knowledge that compounds over time — workspace notes via QMD, structured facts via Graphiti, and associative reasoning via MuninnDB. Three months from now, these agents will know the history of every deployment, every workaround, every preference — without anyone maintaining a wiki.


Part 12: The DGX Spark GPU — From Always-On LLMs to On-Demand Compute

The most significant architectural shift since the initial build is how the DGX Spark’s GPU is used. Originally, Qwen3-32B and Qwen2.5-7B consumed 85% of the 128GB unified memory 24/7 as always-on systemd services. After the K3s migration and the Agent-API’s shift to cloud-only providers, the GPU is now entirely free — available on-demand for any workload that needs it.

What Containers Can Use the GPU

The K3s cluster’s GPU Operator exposes 4 time-sliced GPU instances via nvidia.com/gpu resource requests. Any pod that requests GPU gets scheduled:

WorkloadGPU NeedStatus
vLLM Qwen3-32B70% memory (~90GB)K8s deployment at 0 replicas — scale up with one git commit
vLLM Qwen2.5-7B15% memory (~19GB)K8s deployment at 0 replicas — available for classification tasks
OpenClawNone (CPU-only Node.js)Running on K3s — colocated with TEI/QMD/vLLM for cluster-internal latency
TEI EmbeddingsOptional (CPU currently)Candidate for GPU acceleration if embedding latency becomes a bottleneck
Batch inferenceVariableOn-demand fine-tuning, evaluation, or batch processing jobs

Why Colocate on K3s?

OpenClaw previously ran on Unraid tower (10.0.3.87) as a Docker container. Its hot-path dependencies — vLLM (when active), QMD search, and TEI embeddings — all run on the DGX Spark K3s cluster (10.0.128.196). Every LLM call, memory search, and embedding lookup was crossing the network twice. Moving OpenClaw to K3s turned these into cluster-internal service calls:

Before (cross-network):
  OpenClaw (10.0.3.87) → TCP proxy → vLLM (10.0.128.196:8000)
  OpenClaw (10.0.3.87) → HTTP → TEI (10.0.3.89:8080)

After (cluster-internal, current):
  OpenClaw pod → vllm.vllm.svc.cluster.local:8000
  OpenClaw pod → tei-embeddings.tei.svc.cluster.local:8080
  OpenClaw pod → qmd-search.qmd.svc.cluster.local:8181

OpenClaw runs as an ArgoCD-managed deployment on K3s — namespace, deployment, service, and external-secret manifests committed to the gitops repo. OpenClaw data stays on tower via NFS mount. The Cloudflare Tunnel ingress points to a MetalLB LoadBalancer IP (10.0.128.203).

Services that remain on tower (Agent-API, Graphiti, MuninnDB, Vault) are reachable from the K3s pod via OPNsense inter-subnet routing. Only the hot-path dependencies benefit from colocation.


Lessons Learned

Cloud LLMs won the Agent-API battle. The original design used Qwen3-32B locally for Home and General agents, with cloud providers as fallbacks. In practice, the local models consumed 85% of the DGX Spark’s GPU memory 24/7 while handling tasks that Groq and OpenRouter serve equally well in under 5 seconds. Removing the local LLM dependency freed the GPU for on-demand workloads, eliminated the 5-minute vLLM startup blocking Agent-API availability, and simplified the router from a model-based classifier to keyword matching. OpenClaw runs entirely on Claude Sonnet 4.6 and Opus 4.6 through the cli-proxy-api translator. The knowledge graph’s entity extraction and MuninnDB’s memory enrichment also use Claude. The takeaway: don’t permanently allocate expensive GPU memory to always-on services when cloud APIs provide equivalent quality for structured tasks at negligible cost.

Macvlan networking is worth the tradeoff. Clean IPs, no NAT, easy debugging. The loss of inter-container firewall rules is acceptable when every service authenticates at the application layer.

MCP servers are the right abstraction. Instead of building custom integrations for every tool, MCP provides a standard interface that any LLM client can discover and use. Adding a new capability means deploying one MCP server, not modifying every agent.

Patching upstream code is sometimes the only option. When the Graphiti image assumes native OpenAI APIs and you’re running Claude through a proxy, you patch. Three bind-mounted Python files is less maintenance than a fork.

Push authentication to the edge. Early iterations used API gateways and OAuth2 proxies for service authentication — each adding containers and complexity. Cloudflare Access with Auth0 SSO replaced all of that by handling authentication at the tunnel edge. Each service gets a per-hostname Access Application managed through Terraform. No gateway containers, no proxy chains, no htpasswd files. Adding auth to a new service is a Terraform resource, not a Dockerfile.

Vault from day one. Every secret in one place with audit logs and short-lived tokens. The initial setup takes an afternoon. The payoff is never wondering where an API key lives or whether it’s been rotated.

Memory needs layers, not one system. A knowledge graph is great for structured facts (“what IP is X on?”) but terrible for searching through session transcripts. A vector search engine finds similar documents but can’t reason over past decisions. MuninnDB’s associative recall captures the intuitive connections that neither structured search nor similarity matching can express. The three layers aren’t redundant — they solve fundamentally different recall problems.

Local GGUF models are viable for retrieval. QMD’s three models (2.1GB total) run on CPU with 1–3 second latency. That’s fast enough for every-message memory injection and cheap enough to run on a NAS. The reranker alone — a 0.6B parameter model — dramatically improves result quality over raw BM25+vector fusion. Running retrieval locally means memory search costs nothing per query, which matters when it runs on every single message.


The Numbers

MetricValue
Physical hosts3 (OPNsense, Unraid, DGX Spark)
Docker containers (Unraid)20
K8s namespaces (DGX Spark)7
K8s pods (DGX Spark)~18 (platform services, vLLM at 0 replicas)
Local GGUF models3 (QMD: embedding 300M, reranker 0.6B, expansion 1.7B)
Cloud LLM providers3 (Groq, OpenRouter, Anthropic)
AI sub-agents4 (infra, home, github, general)
OpenClaw agents3 (Sparky on Sonnet 4.6, Dev on Opus 4.6, DevOps on Opus 4.6)
Memory systems3 (QMD workspace, Graphiti knowledge graph, MuninnDB cognitive)
Sparky skills16 workspace + 11 global
Dev skills9 workspace + 11 global
MCP servers7
Cloudflare Access applications7 (registry, registry-ui, ArgoCD, Grafana, Traefik, DGX, OpenClaw)
Caddy reverse proxy entries21
Vault secret paths15+
Knowledge graph entity types8
MuninnDB vaults4 (sparky, dev, devops, infra)
MuninnDB MCP tools33
Total agent tools80+
GPU memory allocated0 (available on-demand via K3s)
QMD GGUF models in RAM~2.1GB

Three hosts. Twenty-one containers. Eighteen Kubernetes pods. Eighty tools. Three memory systems. GPU on-demand. Zero manual memory management.

The agents remember. The graph grows. The memories compound. The homelab learns.

Comments & Discussion

Discussions are powered by GitHub. Sign in with your GitHub account to leave a comment.

About the Author

Vitale Mazo is a Senior Cloud Platform Engineer with 19+ years of experience in enterprise IT, specializing in cloud native technologies and multi-cloud infrastructure design.

Related Posts

How I Kept OpenClaw Alive After Anthropic Killed Third-Party Billing
AI Infrastructure
13 min

How I Kept OpenClaw Alive After Anthropic Killed Third-Party Billing

On April 4, 2026, Anthropic silently revoked subscription billing for third-party AI harnesses. Here's the full story of how I rebuilt the request pipeline — from CLI backend to a 7-layer bidirectional proxy — to keep 13 autonomous agents running on my homelab without paying Extra Usage.

Read