Multi-Agent Orchestrator
•4 min read
Self-Healing Networks
Automatic agent rerouting when health degrades – Directive-native self-healing.
When an agent becomes unhealthy (circuit breaker opens, health score drops), the system reactively reroutes work to a healthy equivalent. Zero overhead when not configured.
Quick Start
Single-Agent Fallback
When the primary circuit breaker opens, fall back to alternate runners:
import { createAgentOrchestrator } from '@directive-run/ai';
const orchestrator = createAgentOrchestrator({
runner: primaryRunner,
circuitBreaker: { failureThreshold: 3, resetTimeoutMs: 30000 },
selfHealing: {
fallbackRunners: [backupRunner, emergencyRunner],
onReroute: (event) => console.log(`Rerouted to ${event.reroutedTo}`),
},
});
Fallback runners are tried in order. When the primary circuit recovers (half-open → closed), it automatically resumes as the primary.
Multi-Agent Rerouting
When an agent's circuit breaker opens, reroute to an equivalent agent:
import { createMultiAgentOrchestrator } from '@directive-run/ai';
const orchestrator = createMultiAgentOrchestrator({
runner,
agents: {
'gpt-writer': { agent: gptWriter, capabilities: ['writing'] },
'claude-writer': { agent: claudeWriter, capabilities: ['writing'] },
'researcher': { agent: researcher, capabilities: ['research'] },
},
selfHealing: {
circuitBreakerDefaults: { failureThreshold: 3, resetTimeoutMs: 30000 },
useCapabilities: true,
selectionStrategy: 'healthiest',
onReroute: (event) => console.log(`${event.originalAgent} → ${event.reroutedTo}`),
},
});
When gpt-writer fails, the system automatically reroutes to claude-writer (same writing capability).
Health Monitor
The health monitor tracks per-agent metrics in a rolling time window:
import { createHealthMonitor } from '@directive-run/ai';
const monitor = createHealthMonitor({
windowMs: 60000, // 60-second rolling window
maxNormalLatencyMs: 5000, // latency normalization ceiling
maxEventsPerAgent: 1000, // max events per agent before FIFO eviction
});
monitor.recordSuccess('agent-a', 120);
monitor.recordFailure('agent-a', 5000, new Error('timeout'));
const score = monitor.getHealthScore('agent-a'); // 0-100
const metrics = monitor.getMetrics('agent-a');
// { agentId, circuitState, successRate, avgLatencyMs, healthScore, lastErrors, ... }
Health Score Formula
The score is a weighted combination (all configurable):
| Factor | Default Weight | How it's computed |
|---|---|---|
| Success rate | 0.5 | successes / totalEvents |
| Latency | 0.3 | 1 - (avgLatency / maxNormalLatencyMs) |
| Circuit state | 0.2 | CLOSED=1, HALF_OPEN=0.5, OPEN=0 |
No data = score 50 (neutral). Score range: 0-100.
Equivalency Resolution
The system finds equivalent agents through two mechanisms:
Capability Matching (default)
Agents with overlapping capabilities arrays are considered equivalent:
agents: {
'fast-writer': { agent: fastWriter, capabilities: ['writing', 'summarization'] },
'deep-writer': { agent: deepWriter, capabilities: ['writing', 'analysis'] },
'researcher': { agent: researcher, capabilities: ['research'] },
},
selfHealing: { useCapabilities: true },
When fast-writer fails, deep-writer is a candidate (shared writing capability). researcher is not (no overlap).
Explicit Groups
For fine-grained control, define equivalency groups:
selfHealing: {
equivalencyGroups: {
writers: ['fast-writer', 'deep-writer', 'backup-writer'],
researchers: ['researcher', 'backup-researcher'],
},
},
Explicit groups are checked first, then capability matching. Unhealthy agents are filtered out, and the healthiest equivalent is selected.
Selection Strategy
| Strategy | Behavior |
|---|---|
healthiest (default) | Pick the equivalent with the highest health score |
round-robin | Rotate through equivalents evenly |
selfHealing: {
selectionStrategy: 'round-robin',
},
Degradation Policies
When all equivalents are exhausted:
| Policy | Behavior |
|---|---|
reject (default) | Throw the original error |
fallback-response | Return a static response |
selfHealing: {
degradation: 'fallback-response',
fallbackResponse: { output: 'Service temporarily unavailable.' },
},
Single-Agent Configuration
interface SelfHealingConfig {
fallbackRunners?: AgentRunner[]; // tried in order
fallbackAgent?: AgentLike; // alternate agent definition
circuitBreaker?: CircuitBreakerConfig;
healthThreshold?: number; // default 30
degradation?: 'reject' | 'fallback-response';
fallbackResponse?: unknown;
onReroute?: (event: RerouteEvent) => void;
}
Circuit breaker required
selfHealing requires a circuitBreaker to detect failures. If you configure selfHealing without circuitBreaker, a dev-mode warning is emitted.
Multi-Agent Configuration
interface MultiAgentSelfHealingConfig {
circuitBreakerDefaults?: CircuitBreakerConfig;
healthThreshold?: number; // default 30
equivalencyGroups?: Record<string, string[]>;
useCapabilities?: boolean; // default true
selectionStrategy?: 'healthiest' | 'round-robin';
degradation?: 'reject' | 'fallback-response';
fallbackResponse?: unknown;
onReroute?: (event: RerouteEvent) => void;
healthMonitor?: HealthMonitorConfig;
}
Lifecycle Hooks
selfHealing: {
onReroute: (event) => {
// event.originalAgent – the unhealthy agent
// event.reroutedTo – the replacement
// event.reason – why rerouting happened
console.log(`Rerouted ${event.originalAgent} → ${event.reroutedTo}: ${event.reason}`);
},
},
Circular Reroute Guard
A rerouted agent cannot itself reroute – maximum 1 hop. This prevents infinite reroute loops when multiple agents in an equivalency group are failing simultaneously.
Zero-Overhead Guarantee
All self-healing setup is gated behind configuration checks. When selfHealing is not provided:
- No
HealthMonitorcreated - No extra facts in the Directive system
- No extra constraints or resolvers
- The existing
circuitBreakeroption works exactly as before
Next Steps
- Multi-Agent Orchestrator – Setup and configuration
- Execution Patterns – Coordinating agent execution
- Resilience & Routing – Retry and fallback strategies
- DevTools – Visualize health metrics in the Health view
- Testing –
assertAgentHealthandassertReroutedhelpers

