Page:
Architecture Resilience System
Pages
Architecture Commands and API
Architecture Context System
Architecture Core Engine
Architecture Event Sourcing
Architecture Generative Reflection
Architecture Helpers
Architecture Journal System
Architecture LLM Interaction
Architecture LLM Providers
Architecture Logging
Architecture Memory and Sleep
Architecture Overview
Architecture Persona Protection
Architecture Prompt System
Architecture RAG Implementation
Architecture Resilience System
Architecture Safety System
Architecture Self Management
Architecture Sub Agent Delegation
Architecture Task Assessment
Architecture Token Management
Architecture Tool System
Configuration Reference
Context and Memory Flow Analysis
Data Flow 01 Context Compaction
Data Flow 02 ReAct Loop
Data Flow 03 Memory Consolidation
Data Flow 04 Message Classification
Data Flow 05 Entity Profile System
Data Flow 06 Tool Execution
Data Flow 07 Sleep Mode Transitions
Data Flow 08 LLM Provider Interaction
Data Flow 09 Self Management Operations
Home
LLM Decision Patterns
Research Foundations
User Guide 00 Index
User Guide 01 Getting Started
User Guide 02 Configuration and Customization
User Guide 03 Advanced Capabilities
User Guide 04 Troubleshooting
No results
1
Architecture Resilience System
blightbow edited this page 2025-12-08 04:15:25 +00:00
Table of Contents
Architecture: Resilience System
Infrastructure - Circuit Breaker and Result Caching
Overview
The resilience system protects against cascading failures and reduces latency:
- Circuit Breaker - Fail-fast pattern for external service failures
- Tool Result Cache - Scope-based caching for expensive operations
1. Circuit Breaker
Implements the circuit breaker pattern from Michael Nygard's "Release It!" to protect against cascading failures when external services (LLM, Memory, RAG) become unavailable.
State Machine
┌─────────────────────────────────────────────────────────────────┐
│ CLOSED │
│ (Normal operation) │
│ │
│ Requests flow through, failures tracked in sliding window │
│ │
│ Transition → OPEN when: failures >= failure_threshold │
└──────────────────────────────┬──────────────────────────────────┘
│ failure_threshold reached
▼
┌─────────────────────────────────────────────────────────────────┐
│ OPEN │
│ (Service failing) │
│ │
│ Requests rejected immediately (fail-fast) │
│ Raises CircuitOpenError with retry_after │
│ │
│ Transition → HALF_OPEN after: timeout_seconds elapsed │
└──────────────────────────────┬──────────────────────────────────┘
│ timeout elapsed
▼
┌─────────────────────────────────────────────────────────────────┐
│ HALF_OPEN │
│ (Testing recovery) │
│ │
│ Limited requests allowed to probe service health │
│ │
│ Transition → CLOSED after: success_threshold consecutive │
│ Transition → OPEN on: any failure │
└─────────────────────────────────────────────────────────────────┘
Configuration
from evennia.contrib.base_systems.ai.resilience import (
CircuitBreaker,
CircuitBreakerConfig,
CircuitOpenError,
)
config = CircuitBreakerConfig(
failure_threshold=5, # Failures to trip OPEN
success_threshold=2, # Successes in HALF_OPEN to close
timeout_seconds=60.0, # Time in OPEN before HALF_OPEN
window_seconds=120.0, # Sliding window for failure tracking
excluded_exceptions=( # Business errors don't count
ValueError,
KeyError,
),
)
breaker = CircuitBreaker("llm_service", config)
Usage Pattern
@inlineCallbacks
def call_external_service():
if not breaker.is_available:
# Graceful degradation
defer.returnValue(fallback_value)
try:
result = yield external_service.call()
breaker.record_success()
defer.returnValue(result)
except Exception as e:
breaker.record_failure(e)
raise
Statistics
stats = breaker.get_stats()
# {
# "name": "llm_service",
# "state": "closed",
# "failures_in_window": 2,
# "failure_threshold": 5,
# "total_failures": 15,
# "total_successes": 1230,
# "total_rejections": 42,
# "last_failure_error": "Connection timeout",
# "retry_after": None,
# "half_open_successes": 0,
# "success_threshold": 2,
# }
Registry
For multi-service management:
from evennia.contrib.base_systems.ai.resilience import CircuitBreakerRegistry
registry = CircuitBreakerRegistry()
# Get or create with defaults
llm_breaker = registry.get_or_create("llm")
memory_breaker = registry.get_or_create("memory", memory_config)
# Health check across all services
health = registry.get_all_stats()
# Reset all (manual recovery)
registry.reset_all()
2. Tool Result Cache
In-memory caching with TTL and scope management for expensive read-only tool operations.
Cache Scopes
| Scope | Cleared | Use Case |
|---|---|---|
tick |
End of each tick | Location-dependent data, room inspection |
session |
On mode change | Stable semantic searches, memory queries |
Usage
from evennia.contrib.base_systems.ai.resilience import ToolResultCache, get_cache_key
cache = ToolResultCache()
# Generate deterministic key from tool + args
key = get_cache_key("inspect_location", room_id="abc123")
# → "inspect_location:a1b2c3d4e5f6"
# Check cache first
cached = cache.get(key)
if cached is not None:
return cached
# Compute expensive result
result = expensive_operation()
# Cache with scope
cache.set(key, result, scope="tick")
# Or with explicit TTL
cache.set(key, result, scope="session", ttl=300.0) # 5 minutes
Lifecycle Management
# At end of tick
cache.clear_tick()
# On mode change (awake → sleep)
cache.clear_session()
# Full clear
cache.clear_all()
# Periodic cleanup of TTL-expired entries
cache.prune_expired()
Statistics
stats = cache.get_stats()
# {
# "hits": 150,
# "misses": 45,
# "hit_rate_percent": 76.9,
# "total_entries": 12,
# "entries_by_scope": {"tick": 8, "session": 4},
# }
Integration with Tools
Tools declare cacheability:
class InspectLocationTool(Tool):
name = "inspect_location"
category = ToolCategory.SAFE_CHAIN
cacheable = True # Enables caching
cache_ttl = 60.0 # TTL in seconds
cache_scope = "tick" # Scope for this tool
3. Integration Points
LLM Client Integration
The UnifiedLLMClient accepts a circuit breaker:
from evennia.contrib.base_systems.ai.llm import UnifiedLLMClient
from evennia.contrib.base_systems.ai.resilience import CircuitBreaker
breaker = CircuitBreaker("llm")
client = UnifiedLLMClient(
provider="openai",
circuit_breaker=breaker,
...
)
The client automatically:
- Checks
breaker.is_availablebefore requests - Calls
breaker.record_success()on 200 responses - Calls
breaker.record_failure()on retryable errors
Tool Execution Integration
In tool_execution.py:
# Check cache before execution
cache_key = get_cache_key(tool.name, **parameters)
cached = script.ndb.tool_cache.get(cache_key)
if cached is not None and tool.cacheable:
return cached
# Execute tool
result = yield tool.execute(...)
# Cache result
if tool.cacheable and result.get("success"):
script.ndb.tool_cache.set(
cache_key,
result,
scope=tool.cache_scope,
ttl=tool.cache_ttl
)
Key Files
| File | Lines | Purpose |
|---|---|---|
resilience/circuit_breaker.py |
56-70 | CircuitBreakerState enum |
resilience/circuit_breaker.py |
72-93 | CircuitBreakerConfig dataclass |
resilience/circuit_breaker.py |
95-115 | CircuitOpenError exception |
resilience/circuit_breaker.py |
117-332 | CircuitBreaker class |
resilience/circuit_breaker.py |
334-420 | CircuitBreakerRegistry |
resilience/caching.py |
35-58 | CacheEntry dataclass |
resilience/caching.py |
60-227 | ToolResultCache class |
resilience/caching.py |
229-252 | get_cache_key() helper |
See also: Architecture-LLM-Providers | Architecture-Tool-System | Data-Flow-08-LLM-Provider-Interaction