Multi-Agent Orchestration on NVIDIA GPU: Architecture for Autonomous AI Fleets
Multi-Agent Orchestration on NVIDIA GPU
"4 agents, 1 GPU, 0 conflicts. The secret is architecture, not hardware."
Running a single AI agent on a GPU is straightforward. Running four agents that share the same GPU without conflicts, context leakage, or resource contention — that's an architecture problem.
At Ultra Lab, we've been running a 4-agent fleet on a single NVIDIA RTX 3060 Ti for production workloads. This article covers the orchestration architecture: how agents share GPU resources, maintain isolated contexts, schedule tasks without conflicts, and recover from failures automatically.
The Problem: Multi-Agent on Single GPU
When multiple agents share one GPU, you face three challenges:
- Resource contention: Two agents requesting inference simultaneously will either queue (slow) or crash (OOM)
- Context isolation: Agent A's customer data must never leak into Agent B's social media posts
- Scheduling: 105 daily tasks across 4 agents need to execute without collision
Most multi-agent frameworks solve this by giving each agent its own GPU or API endpoint. We don't have that luxury — we have one RTX 3060 Ti with 8GB VRAM. So we engineered around it.
Architecture Overview
┌──────────────────────────────────────────────────┐
│ Scheduling Layer │
│ (25 systemd timers, staggered) │
├──────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ OpenClaw Gateway (:18789) │ │
│ │ Request routing + agent workspace mgmt │ │
│ ├────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Agent 1 │ │ Agent 2 │ │ Agent 3 │ ... │ │
│ │ │ Context │ │ Context │ │ Context │ │ │
│ │ │ (isolated)│ │(isolated)│(isolated)│ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ ├────────────────────────────────────────────┤ │
│ │ Ollama Server (:11434) │ │
│ │ Single model, sequential inference │ │
│ │ ultralab:7b on RTX 3060 Ti CUDA │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ 62 Scripts (bash + node) │ │
│ │ Data sync, health checks, engage tasks │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ 19 Intelligence Files (.md) │ │
│ │ Pre-computed context injected at runtime │ │
│ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Three layers work together:
- Scheduling: systemd timers stagger tasks across the day
- Gateway: OpenClaw routes requests to the right agent workspace
- Inference: Ollama serves one request at a time on the GPU
Layer 1: Agent Isolation
Each agent has its own workspace — a directory with isolated context files:
~/.openclaw/agents/
├── main/ # UltraLabTW (CEO)
│ ├── IDENTITY.md
│ ├── STRATEGY.md
│ ├── CUSTOMER-INSIGHTS.md
│ ├── POST-PERFORMANCE.md
│ └── ... (19 files)
├── mindthread/ # MindThreadBot
│ ├── IDENTITY.md
│ ├── MINDTHREAD-DATA.md
│ └── ... (subset of files)
├── probe/ # UltraProbeBot
│ ├── IDENTITY.md
│ ├── COMPETITOR-INTEL.md
│ └── ...
└── advisor/ # UltraAdvisor
├── IDENTITY.md
└── ...
The Isolation Principle
Each agent workspace contains only the context files relevant to its role:
- CEO agent: Gets everything — customer insights, strategy, product data, performance metrics
- MindThread agent: Gets MindThread product data + social media performance. No customer insights.
- Probe agent: Gets competitor intel + security research. No customer data.
- Advisor agent: Gets financial advisory context. Minimal cross-agent data.
This isn't just about privacy — it's about token efficiency. Before isolation, all agents loaded the same 19 files (12K tokens of context). After separating contexts, non-CEO agents load 6-8 files (4K tokens). That's a 67% reduction in context size, which directly improves inference speed and quality.
Layer 2: Task Scheduling
25 systemd timers orchestrate the daily workload. The key insight: stagger everything.
Autopost Schedule (UTC+8)
02:00 UltraLabTW autopost #1
03:00 MindThreadBot autopost #1
04:00 UltraProbeBot autopost #1
─── morning batch done ───
08:00 UltraLabTW autopost #2
09:00 MindThreadBot autopost #2
10:00 UltraProbeBot autopost #2
10:15 UltraLabTW engage
10:30 MindThreadBot engage
10:45 UltraProbeBot engage
─── engagement batch done ───
14:00 UltraLabTW autopost #3
15:00 MindThreadBot autopost #3
...
23:00 UltraLabTW daily-reflect
Why Stagger?
Ollama processes one inference request at a time (NUM_PARALLEL=1). If two agents submit requests simultaneously, one queues. On an 8GB GPU, parallel inference causes OOM crashes.
By staggering timers 1 hour apart for autoposts and 15 minutes apart for engage tasks, we guarantee:
- Maximum GPU idle time between tasks (model stays loaded via
KEEP_ALIVE=2h) - No queue buildup
- Predictable execution order for debugging
Timer Reliability
systemd timers are more reliable than cron for this workload:
[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=120
Persistent=true: If the machine was off during scheduled time, run immediately on bootRandomizedDelaySec=120: Add 0-2 minute jitter to avoid thundering herd
Layer 3: Intelligence Pipeline
The 19 intelligence files are the secret weapon. They provide pre-computed context that costs zero LLM tokens to generate:
Data Flow
External Sources Scripts (0 LLM cost) Agent Workspace
───────────────── → ─────────────────── → ──────────────
Firestore inquiries sync-customer-insights CUSTOMER-INSIGHTS.md
MindThread Firebase sync-mindthread-data MINDTHREAD-DATA.md
Moltbook API collect-platform-data platform-intel.md
HN / RSS feeds blogwatcher + hn-trending RESEARCH-NOTES.md
Git commit history dev-to-social recent-commits.md
Each script runs on a systemd timer, fetches data from external sources, and writes structured Markdown files. When an agent runs, its workspace files are injected as context — the LLM reads current, real data without any API calls.
Cost: $0
This is critical. The intelligence pipeline runs entirely on bash scripts, Node.js API calls, and file I/O. No LLM inference needed. The agents get rich context for free.
Example: Customer Insights Pipeline
// sync-customer-insights.js (runs daily at 06:00)
// 1. Query Firestore for recent inquiries
// 2. Format as structured Markdown
// 3. Write to CUSTOMER-INSIGHTS.md
const inquiries = await db.collection('inquiries')
.where('createdAt', '>=', sevenDaysAgo)
.orderBy('createdAt', 'desc')
.get()
// Output: structured markdown with client info, status, follow-up dates
fs.writeFileSync('CUSTOMER-INSIGHTS.md', formatted)
The CEO agent reads this file and makes strategic decisions based on real customer data — without spending a single token on data retrieval.
Failure Recovery
With 105 daily tasks, things will break. Our recovery architecture:
Level 1: Ollama Health Check (every 10 min)
curl -sf http://localhost:11434/api/tags > /dev/null || {
systemctl restart ollama
sleep 10 # wait for model reload
}
Ollama occasionally hangs after ~72 hours. Auto-restart + model reload takes ~8 seconds. No human intervention.
Level 2: Gateway Watchdog (every 2 min)
The OpenClaw gateway has its own health check. If it crashes, systemd restarts it automatically via Restart=always.
Level 3: Task-Level Error Handling
Each cron job has delivery.mode: "failure-alert" — if a task fails, it sends a notification to Discord. If a task succeeds, silence. This means no notification = everything is working.
Level 4: Rate Limit Detection
All engage scripts detect API rate limits and skip gracefully instead of posting error messages:
if echo "$response" | grep -q "RATE_LIMIT\|429\|quota"; then
echo "Rate limited, skipping"
exit 0 # exit clean, don't trigger failure alert
fi
Scaling Patterns
Pattern 1: Add More Agents (Same GPU)
Adding a 5th agent doesn't require more GPU. It requires:
- A new workspace directory with role-specific context files
- New systemd timers staggered into existing schedule gaps
- A new agent config in OpenClaw
GPU utilization stays the same — tasks are sequential, and a single 7B model handles all agents.
Practical limit: ~8 agents on current schedule (24 hours / 3 tasks per agent per day = ~8 agents with comfortable spacing).
Pattern 2: Add More GPU (Same Agents)
Adding a second RTX card enables:
NUM_PARALLEL=2: Two simultaneous inference streams- No staggering needed — agents can run in parallel
- Or: run a larger model (14B) on the primary GPU while the secondary handles overflow
Pattern 3: Hybrid Local + Cloud
Our current approach: 95% of tasks run on local GPU, 5% (complex analysis) goes to cloud API. This scales naturally — as the local workload grows, add GPU capacity for routine tasks while keeping cloud APIs for frontier reasoning.
What We Learned
1. Context Isolation > Model Size
We got better results from a 7B model with clean, isolated context than from a larger model with noisy, shared context. Agent quality is proportional to context quality, not model size.
2. Pre-computed Context is Free Intelligence
The intelligence pipeline (19 .md files) gives agents real-time awareness for $0. This is the highest-ROI investment in our architecture.
3. Sequential is Fine for Agents
Agents don't need real-time inference. A social media post can wait 30 seconds in a queue. Sequential processing on a single GPU is perfectly adequate for autonomous agent workloads.
4. systemd > Everything
We tried cron, PM2, and custom schedulers. systemd timers with Persistent=true and automatic restart are the most reliable scheduling system we've used. Zero missed tasks in 30 days.
5. Silence is the Best Alert
Configure notifications for failures only. If you get no alerts, everything is working. This scales to any number of agents without alert fatigue.
Getting Started
If you want to build a multi-agent fleet on a single NVIDIA GPU:
- Start with 1 agent — get Ollama + OpenClaw running with a single cron job
- Add intelligence files — pre-computed context gives the biggest quality boost
- Add agent 2 — separate workspace, staggered timer, role-specific context
- Monitor for a week — check GPU utilization, task completion, failure rate
- Scale carefully — each new agent adds complexity; keep contexts isolated
Our complete architecture is documented in the open-source repo:
- GitHub: free-tier-agent-fleet
- Agent Fleet Dashboard: ultralab.tw/agent
Ultra Lab builds AI products. Our 4-agent fleet runs autonomously on NVIDIA GPU-accelerated local inference. Learn more at ultralab.tw.