NVIDIAGPUMulti-AgentOrchestrationOpenClawOllamaCUDAArchitecture

Multi-Agent Orchestration on NVIDIA GPU: Architecture for Autonomous AI Fleets

March 10, 2026 · 59 min read

Multi-Agent Orchestration on NVIDIA GPU

"4 agents, 1 GPU, 0 conflicts. The secret is architecture, not hardware."

Running a single AI agent on a GPU is straightforward. Running four agents that share the same GPU without conflicts, context leakage, or resource contention — that's an architecture problem.

At Ultra Lab, we've been running a 4-agent fleet on a single NVIDIA RTX 3060 Ti for production workloads. This article covers the orchestration architecture: how agents share GPU resources, maintain isolated contexts, schedule tasks without conflicts, and recover from failures automatically.

The Problem: Multi-Agent on Single GPU

When multiple agents share one GPU, you face three challenges:

Resource contention: Two agents requesting inference simultaneously will either queue (slow) or crash (OOM)
Context isolation: Agent A's customer data must never leak into Agent B's social media posts
Scheduling: 105 daily tasks across 4 agents need to execute without collision

Most multi-agent frameworks solve this by giving each agent its own GPU or API endpoint. We don't have that luxury — we have one RTX 3060 Ti with 8GB VRAM. So we engineered around it.

Architecture Overview

┌──────────────────────────────────────────────────┐
│                  Scheduling Layer                 │
│           (25 systemd timers, staggered)          │
├──────────────────────────────────────────────────┤
│                                                   │
│  ┌────────────────────────────────────────────┐   │
│  │          OpenClaw Gateway (:18789)          │   │
│  │    Request routing + agent workspace mgmt   │   │
│  ├────────────────────────────────────────────┤   │
│  │                                             │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐      │   │
│  │  │ Agent 1 │ │ Agent 2 │ │ Agent 3 │ ...  │   │
│  │  │ Context │ │ Context │ │ Context │      │   │
│  │  │ (isolated)│ │(isolated)│(isolated)│     │   │
│  │  └─────────┘ └─────────┘ └─────────┘      │   │
│  │                                             │   │
│  ├────────────────────────────────────────────┤   │
│  │          Ollama Server (:11434)             │   │
│  │      Single model, sequential inference     │   │
│  │      ultralab:7b on RTX 3060 Ti CUDA       │   │
│  └────────────────────────────────────────────┘   │
│                                                   │
│  ┌────────────────────────────────────────────┐   │
│  │         62 Scripts (bash + node)            │   │
│  │    Data sync, health checks, engage tasks   │   │
│  └────────────────────────────────────────────┘   │
│                                                   │
│  ┌────────────────────────────────────────────┐   │
│  │     19 Intelligence Files (.md)             │   │
│  │   Pre-computed context injected at runtime  │   │
│  └────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────┘

Three layers work together:

Scheduling: systemd timers stagger tasks across the day
Gateway: OpenClaw routes requests to the right agent workspace
Inference: Ollama serves one request at a time on the GPU

Layer 1: Agent Isolation

Each agent has its own workspace — a directory with isolated context files:

~/.openclaw/agents/
├── main/           # UltraLabTW (CEO)
│   ├── IDENTITY.md
│   ├── STRATEGY.md
│   ├── CUSTOMER-INSIGHTS.md
│   ├── POST-PERFORMANCE.md
│   └── ... (19 files)
├── mindthread/     # MindThreadBot
│   ├── IDENTITY.md
│   ├── MINDTHREAD-DATA.md
│   └── ... (subset of files)
├── probe/          # UltraProbeBot
│   ├── IDENTITY.md
│   ├── COMPETITOR-INTEL.md
│   └── ...
└── advisor/        # UltraAdvisor
    ├── IDENTITY.md
    └── ...

The Isolation Principle

Each agent workspace contains only the context files relevant to its role:

CEO agent: Gets everything — customer insights, strategy, product data, performance metrics
MindThread agent: Gets MindThread product data + social media performance. No customer insights.
Probe agent: Gets competitor intel + security research. No customer data.
Advisor agent: Gets financial advisory context. Minimal cross-agent data.

This isn't just about privacy — it's about token efficiency. Before isolation, all agents loaded the same 19 files (~~12K tokens of context). After separating contexts, non-CEO agents load 6-8 files (~~4K tokens). That's a 67% reduction in context size, which directly improves inference speed and quality.

Layer 2: Task Scheduling

25 systemd timers orchestrate the daily workload. The key insight: stagger everything.

Autopost Schedule (UTC+8)

02:00  UltraLabTW    autopost #1
03:00  MindThreadBot autopost #1
04:00  UltraProbeBot autopost #1
─── morning batch done ───
08:00  UltraLabTW    autopost #2
09:00  MindThreadBot autopost #2
10:00  UltraProbeBot autopost #2
10:15  UltraLabTW    engage
10:30  MindThreadBot engage
10:45  UltraProbeBot engage
─── engagement batch done ───
14:00  UltraLabTW    autopost #3
15:00  MindThreadBot autopost #3
...
23:00  UltraLabTW    daily-reflect

Why Stagger?

Ollama processes one inference request at a time (NUM_PARALLEL=1). If two agents submit requests simultaneously, one queues. On an 8GB GPU, parallel inference causes OOM crashes.

By staggering timers 1 hour apart for autoposts and 15 minutes apart for engage tasks, we guarantee:

Maximum GPU idle time between tasks (model stays loaded via KEEP_ALIVE=2h)
No queue buildup
Predictable execution order for debugging

Timer Reliability

systemd timers are more reliable than cron for this workload:

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=120

Persistent=true: If the machine was off during scheduled time, run immediately on boot
RandomizedDelaySec=120: Add 0-2 minute jitter to avoid thundering herd

Layer 3: Intelligence Pipeline

The 19 intelligence files are the secret weapon. They provide pre-computed context that costs zero LLM tokens to generate:

Data Flow

External Sources          Scripts (0 LLM cost)       Agent Workspace
─────────────────    →    ───────────────────    →    ──────────────
Firestore inquiries       sync-customer-insights     CUSTOMER-INSIGHTS.md
MindThread Firebase       sync-mindthread-data       MINDTHREAD-DATA.md
Moltbook API              collect-platform-data      platform-intel.md
HN / RSS feeds            blogwatcher + hn-trending  RESEARCH-NOTES.md
Git commit history        dev-to-social              recent-commits.md

Each script runs on a systemd timer, fetches data from external sources, and writes structured Markdown files. When an agent runs, its workspace files are injected as context — the LLM reads current, real data without any API calls.

Cost: $0

This is critical. The intelligence pipeline runs entirely on bash scripts, Node.js API calls, and file I/O. No LLM inference needed. The agents get rich context for free.

Example: Customer Insights Pipeline

// sync-customer-insights.js (runs daily at 06:00)
// 1. Query Firestore for recent inquiries
// 2. Format as structured Markdown
// 3. Write to CUSTOMER-INSIGHTS.md

const inquiries = await db.collection('inquiries')
  .where('createdAt', '>=', sevenDaysAgo)
  .orderBy('createdAt', 'desc')
  .get()

// Output: structured markdown with client info, status, follow-up dates
fs.writeFileSync('CUSTOMER-INSIGHTS.md', formatted)

The CEO agent reads this file and makes strategic decisions based on real customer data — without spending a single token on data retrieval.

Failure Recovery

With 105 daily tasks, things will break. Our recovery architecture:

Level 1: Ollama Health Check (every 10 min)

curl -sf http://localhost:11434/api/tags > /dev/null || {
  systemctl restart ollama
  sleep 10  # wait for model reload
}

Ollama occasionally hangs after ~72 hours. Auto-restart + model reload takes ~8 seconds. No human intervention.

Level 2: Gateway Watchdog (every 2 min)

The OpenClaw gateway has its own health check. If it crashes, systemd restarts it automatically via Restart=always.

Level 3: Task-Level Error Handling

Each cron job has delivery.mode: "failure-alert" — if a task fails, it sends a notification to Discord. If a task succeeds, silence. This means no notification = everything is working.

Level 4: Rate Limit Detection

All engage scripts detect API rate limits and skip gracefully instead of posting error messages:

if echo "$response" | grep -q "RATE_LIMIT\|429\|quota"; then
  echo "Rate limited, skipping"
  exit 0  # exit clean, don't trigger failure alert
fi

Scaling Patterns

Pattern 1: Add More Agents (Same GPU)

Adding a 5th agent doesn't require more GPU. It requires:

A new workspace directory with role-specific context files
New systemd timers staggered into existing schedule gaps
A new agent config in OpenClaw

GPU utilization stays the same — tasks are sequential, and a single 7B model handles all agents.

Practical limit: ~8 agents on current schedule (24 hours / 3 tasks per agent per day = ~8 agents with comfortable spacing).

Pattern 2: Add More GPU (Same Agents)

Adding a second RTX card enables:

NUM_PARALLEL=2: Two simultaneous inference streams
No staggering needed — agents can run in parallel
Or: run a larger model (14B) on the primary GPU while the secondary handles overflow

Pattern 3: Hybrid Local + Cloud

Our current approach: 95% of tasks run on local GPU, 5% (complex analysis) goes to cloud API. This scales naturally — as the local workload grows, add GPU capacity for routine tasks while keeping cloud APIs for frontier reasoning.

What We Learned

1. Context Isolation > Model Size

We got better results from a 7B model with clean, isolated context than from a larger model with noisy, shared context. Agent quality is proportional to context quality, not model size.

2. Pre-computed Context is Free Intelligence

The intelligence pipeline (19 .md files) gives agents real-time awareness for $0. This is the highest-ROI investment in our architecture.

3. Sequential is Fine for Agents

Agents don't need real-time inference. A social media post can wait 30 seconds in a queue. Sequential processing on a single GPU is perfectly adequate for autonomous agent workloads.

4. systemd > Everything

We tried cron, PM2, and custom schedulers. systemd timers with Persistent=true and automatic restart are the most reliable scheduling system we've used. Zero missed tasks in 30 days.

5. Silence is the Best Alert

Configure notifications for failures only. If you get no alerts, everything is working. This scales to any number of agents without alert fatigue.

Getting Started

If you want to build a multi-agent fleet on a single NVIDIA GPU:

Start with 1 agent — get Ollama + OpenClaw running with a single cron job
Add intelligence files — pre-computed context gives the biggest quality boost
Add agent 2 — separate workspace, staggered timer, role-specific context
Monitor for a week — check GPU utilization, task completion, failure rate
Scale carefully — each new agent adds complexity; keep contexts isolated

Our complete architecture is documented in the open-source repo:

GitHub: free-tier-agent-fleet
Agent Fleet Dashboard: ultralab.tw/agent

Ultra Lab builds AI products. Our 4-agent fleet runs autonomously on NVIDIA GPU-accelerated local inference. Learn more at ultralab.tw.

Multi-Agent Orchestration on NVIDIA GPU: Architecture for Autonomous AI Fleets

Multi-Agent Orchestration on NVIDIA GPU

The Problem: Multi-Agent on Single GPU

Architecture Overview

Layer 1: Agent Isolation

The Isolation Principle

Layer 2: Task Scheduling

Autopost Schedule (UTC+8)

Why Stagger?

Timer Reliability

Layer 3: Intelligence Pipeline

Data Flow

Cost: $0

Failure Recovery

Level 1: Ollama Health Check (every 10 min)

Level 2: Gateway Watchdog (every 2 min)

Level 3: Task-Level Error Handling

Level 4: Rate Limit Detection

Scaling Patterns

Pattern 1: Add More Agents (Same GPU)

Pattern 2: Add More GPU (Same Agents)

Pattern 3: Hybrid Local + Cloud

What We Learned

1. Context Isolation > Model Size

2. Pre-computed Context is Free Intelligence

3. Sequential is Fine for Agents

4. systemd > Everything

5. Silence is the Best Alert

Getting Started

Join the Solo Lab Community

Need Technical Help?

#Multi-Agent Orchestration on NVIDIA GPU

#The Problem: Multi-Agent on Single GPU

#Architecture Overview

#Layer 1: Agent Isolation

#The Isolation Principle

#Layer 2: Task Scheduling

#Autopost Schedule (UTC+8)

#Why Stagger?

#Timer Reliability

#Layer 3: Intelligence Pipeline

#Data Flow

#Cost: $0

#Failure Recovery

#Level 1: Ollama Health Check (every 10 min)

#Level 2: Gateway Watchdog (every 2 min)

#Level 3: Task-Level Error Handling

#Level 4: Rate Limit Detection

#Scaling Patterns

#Pattern 1: Add More Agents (Same GPU)

#Pattern 2: Add More GPU (Same Agents)

#Pattern 3: Hybrid Local + Cloud

#What We Learned

#1. Context Isolation > Model Size

#2. Pre-computed Context is Free Intelligence

#3. Sequential is Fine for Agents

#4. systemd > Everything

#5. Silence is the Best Alert

#Getting Started

Related Posts

Running a 4-Agent AI Fleet on a Single NVIDIA RTX 3060 Ti

Local LLM on NVIDIA GPU vs Cloud API: A Real Cost Analysis

Maxing Out the Free Tier: 105 Automated Tasks on 1,500 RPD -- A $0/Month AI Agent Fleet

Weekly AI Automation Playbook

Join the Solo Lab Community

Need Technical Help?

Multi-Agent Orchestration on NVIDIA GPU

The Problem: Multi-Agent on Single GPU

Architecture Overview

Layer 1: Agent Isolation

The Isolation Principle

Layer 2: Task Scheduling

Autopost Schedule (UTC+8)

Why Stagger?

Timer Reliability

Layer 3: Intelligence Pipeline

Data Flow

Cost: $0

Failure Recovery

Level 1: Ollama Health Check (every 10 min)

Level 2: Gateway Watchdog (every 2 min)

Level 3: Task-Level Error Handling

Level 4: Rate Limit Detection

Scaling Patterns

Pattern 1: Add More Agents (Same GPU)

Pattern 2: Add More GPU (Same Agents)

Pattern 3: Hybrid Local + Cloud

What We Learned

1. Context Isolation > Model Size

2. Pre-computed Context is Free Intelligence

3. Sequential is Fine for Agents

4. systemd > Everything

5. Silence is the Best Alert

Getting Started