Architecture: Resilience System

Infrastructure - Circuit Breaker and Result Caching

Overview

The resilience system protects against cascading failures and reduces latency:

Circuit Breaker - Fail-fast pattern for external service failures
Tool Result Cache - Scope-based caching for expensive operations

1. Circuit Breaker

Implements the circuit breaker pattern from Michael Nygard's "Release It!" to protect against cascading failures when external services (LLM, Memory, RAG) become unavailable.

State Machine

┌─────────────────────────────────────────────────────────────────┐
│                         CLOSED                                   │
│                    (Normal operation)                            │
│                                                                  │
│   Requests flow through, failures tracked in sliding window      │
│                                                                  │
│   Transition → OPEN when: failures >= failure_threshold          │
└──────────────────────────────┬──────────────────────────────────┘
                               │ failure_threshold reached
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                          OPEN                                    │
│                    (Service failing)                             │
│                                                                  │
│   Requests rejected immediately (fail-fast)                      │
│   Raises CircuitOpenError with retry_after                       │
│                                                                  │
│   Transition → HALF_OPEN after: timeout_seconds elapsed          │
└──────────────────────────────┬──────────────────────────────────┘
                               │ timeout elapsed
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                        HALF_OPEN                                 │
│                    (Testing recovery)                            │
│                                                                  │
│   Limited requests allowed to probe service health               │
│                                                                  │
│   Transition → CLOSED after: success_threshold consecutive       │
│   Transition → OPEN on: any failure                              │
└─────────────────────────────────────────────────────────────────┘

Configuration

from evennia.contrib.base_systems.ai.resilience import (
    CircuitBreaker,
    CircuitBreakerConfig,
    CircuitOpenError,
)

config = CircuitBreakerConfig(
    failure_threshold=5,       # Failures to trip OPEN
    success_threshold=2,       # Successes in HALF_OPEN to close
    timeout_seconds=60.0,      # Time in OPEN before HALF_OPEN
    window_seconds=120.0,      # Sliding window for failure tracking
    excluded_exceptions=(      # Business errors don't count
        ValueError,
        KeyError,
    ),
)

breaker = CircuitBreaker("llm_service", config)

Usage Pattern

@inlineCallbacks
def call_external_service():
    if not breaker.is_available:
        # Graceful degradation
        defer.returnValue(fallback_value)

    try:
        result = yield external_service.call()
        breaker.record_success()
        defer.returnValue(result)
    except Exception as e:
        breaker.record_failure(e)
        raise

Statistics

stats = breaker.get_stats()
# {
#     "name": "llm_service",
#     "state": "closed",
#     "failures_in_window": 2,
#     "failure_threshold": 5,
#     "total_failures": 15,
#     "total_successes": 1230,
#     "total_rejections": 42,
#     "last_failure_error": "Connection timeout",
#     "retry_after": None,
#     "half_open_successes": 0,
#     "success_threshold": 2,
# }

Registry

For multi-service management:

from evennia.contrib.base_systems.ai.resilience import CircuitBreakerRegistry

registry = CircuitBreakerRegistry()

# Get or create with defaults
llm_breaker = registry.get_or_create("llm")
memory_breaker = registry.get_or_create("memory", memory_config)

# Health check across all services
health = registry.get_all_stats()

# Reset all (manual recovery)
registry.reset_all()

2. Tool Result Cache

In-memory caching with TTL and scope management for expensive read-only tool operations.

Cache Scopes

Scope	Cleared	Use Case
`tick`	End of each tick	Location-dependent data, room inspection
`session`	On mode change	Stable semantic searches, memory queries

Usage

from evennia.contrib.base_systems.ai.resilience import ToolResultCache, get_cache_key

cache = ToolResultCache()

# Generate deterministic key from tool + args
key = get_cache_key("inspect_location", room_id="abc123")
# → "inspect_location:a1b2c3d4e5f6"

# Check cache first
cached = cache.get(key)
if cached is not None:
    return cached

# Compute expensive result
result = expensive_operation()

# Cache with scope
cache.set(key, result, scope="tick")
# Or with explicit TTL
cache.set(key, result, scope="session", ttl=300.0)  # 5 minutes

Lifecycle Management

# At end of tick
cache.clear_tick()

# On mode change (awake → sleep)
cache.clear_session()

# Full clear
cache.clear_all()

# Periodic cleanup of TTL-expired entries
cache.prune_expired()

Statistics

stats = cache.get_stats()
# {
#     "hits": 150,
#     "misses": 45,
#     "hit_rate_percent": 76.9,
#     "total_entries": 12,
#     "entries_by_scope": {"tick": 8, "session": 4},
# }

Integration with Tools

Tools declare cacheability:

class InspectLocationTool(Tool):
    name = "inspect_location"
    category = ToolCategory.SAFE_CHAIN
    cacheable = True           # Enables caching
    cache_ttl = 60.0           # TTL in seconds
    cache_scope = "tick"       # Scope for this tool

3. Integration Points

LLM Client Integration

The UnifiedLLMClient accepts a circuit breaker:

from evennia.contrib.base_systems.ai.llm import UnifiedLLMClient
from evennia.contrib.base_systems.ai.resilience import CircuitBreaker

breaker = CircuitBreaker("llm")
client = UnifiedLLMClient(
    provider="openai",
    circuit_breaker=breaker,
    ...
)

The client automatically:

Checks breaker.is_available before requests
Calls breaker.record_success() on 200 responses
Calls breaker.record_failure() on retryable errors

Tool Execution Integration

In tool_execution.py:

# Check cache before execution
cache_key = get_cache_key(tool.name, **parameters)
cached = script.ndb.tool_cache.get(cache_key)
if cached is not None and tool.cacheable:
    return cached

# Execute tool
result = yield tool.execute(...)

# Cache result
if tool.cacheable and result.get("success"):
    script.ndb.tool_cache.set(
        cache_key,
        result,
        scope=tool.cache_scope,
        ttl=tool.cache_ttl
    )

Key Files

File	Lines	Purpose
`resilience/circuit_breaker.py`	56-70	`CircuitBreakerState` enum
`resilience/circuit_breaker.py`	72-93	`CircuitBreakerConfig` dataclass
`resilience/circuit_breaker.py`	95-115	`CircuitOpenError` exception
`resilience/circuit_breaker.py`	117-332	`CircuitBreaker` class
`resilience/circuit_breaker.py`	334-420	`CircuitBreakerRegistry`
`resilience/caching.py`	35-58	`CacheEntry` dataclass
`resilience/caching.py`	60-227	`ToolResultCache` class
`resilience/caching.py`	229-252	`get_cache_key()` helper