1 Architecture Resilience System
blightbow edited this page 2025-12-08 04:15:25 +00:00

Architecture: Resilience System

Infrastructure - Circuit Breaker and Result Caching


Overview

The resilience system protects against cascading failures and reduces latency:

  • Circuit Breaker - Fail-fast pattern for external service failures
  • Tool Result Cache - Scope-based caching for expensive operations

1. Circuit Breaker

Implements the circuit breaker pattern from Michael Nygard's "Release It!" to protect against cascading failures when external services (LLM, Memory, RAG) become unavailable.

State Machine

┌─────────────────────────────────────────────────────────────────┐
│                         CLOSED                                   │
│                    (Normal operation)                            │
│                                                                  │
│   Requests flow through, failures tracked in sliding window      │
│                                                                  │
│   Transition → OPEN when: failures >= failure_threshold          │
└──────────────────────────────┬──────────────────────────────────┘
                               │ failure_threshold reached
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                          OPEN                                    │
│                    (Service failing)                             │
│                                                                  │
│   Requests rejected immediately (fail-fast)                      │
│   Raises CircuitOpenError with retry_after                       │
│                                                                  │
│   Transition → HALF_OPEN after: timeout_seconds elapsed          │
└──────────────────────────────┬──────────────────────────────────┘
                               │ timeout elapsed
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                        HALF_OPEN                                 │
│                    (Testing recovery)                            │
│                                                                  │
│   Limited requests allowed to probe service health               │
│                                                                  │
│   Transition → CLOSED after: success_threshold consecutive       │
│   Transition → OPEN on: any failure                              │
└─────────────────────────────────────────────────────────────────┘

Configuration

from evennia.contrib.base_systems.ai.resilience import (
    CircuitBreaker,
    CircuitBreakerConfig,
    CircuitOpenError,
)

config = CircuitBreakerConfig(
    failure_threshold=5,       # Failures to trip OPEN
    success_threshold=2,       # Successes in HALF_OPEN to close
    timeout_seconds=60.0,      # Time in OPEN before HALF_OPEN
    window_seconds=120.0,      # Sliding window for failure tracking
    excluded_exceptions=(      # Business errors don't count
        ValueError,
        KeyError,
    ),
)

breaker = CircuitBreaker("llm_service", config)

Usage Pattern

@inlineCallbacks
def call_external_service():
    if not breaker.is_available:
        # Graceful degradation
        defer.returnValue(fallback_value)

    try:
        result = yield external_service.call()
        breaker.record_success()
        defer.returnValue(result)
    except Exception as e:
        breaker.record_failure(e)
        raise

Statistics

stats = breaker.get_stats()
# {
#     "name": "llm_service",
#     "state": "closed",
#     "failures_in_window": 2,
#     "failure_threshold": 5,
#     "total_failures": 15,
#     "total_successes": 1230,
#     "total_rejections": 42,
#     "last_failure_error": "Connection timeout",
#     "retry_after": None,
#     "half_open_successes": 0,
#     "success_threshold": 2,
# }

Registry

For multi-service management:

from evennia.contrib.base_systems.ai.resilience import CircuitBreakerRegistry

registry = CircuitBreakerRegistry()

# Get or create with defaults
llm_breaker = registry.get_or_create("llm")
memory_breaker = registry.get_or_create("memory", memory_config)

# Health check across all services
health = registry.get_all_stats()

# Reset all (manual recovery)
registry.reset_all()

2. Tool Result Cache

In-memory caching with TTL and scope management for expensive read-only tool operations.

Cache Scopes

Scope Cleared Use Case
tick End of each tick Location-dependent data, room inspection
session On mode change Stable semantic searches, memory queries

Usage

from evennia.contrib.base_systems.ai.resilience import ToolResultCache, get_cache_key

cache = ToolResultCache()

# Generate deterministic key from tool + args
key = get_cache_key("inspect_location", room_id="abc123")
# → "inspect_location:a1b2c3d4e5f6"

# Check cache first
cached = cache.get(key)
if cached is not None:
    return cached

# Compute expensive result
result = expensive_operation()

# Cache with scope
cache.set(key, result, scope="tick")
# Or with explicit TTL
cache.set(key, result, scope="session", ttl=300.0)  # 5 minutes

Lifecycle Management

# At end of tick
cache.clear_tick()

# On mode change (awake → sleep)
cache.clear_session()

# Full clear
cache.clear_all()

# Periodic cleanup of TTL-expired entries
cache.prune_expired()

Statistics

stats = cache.get_stats()
# {
#     "hits": 150,
#     "misses": 45,
#     "hit_rate_percent": 76.9,
#     "total_entries": 12,
#     "entries_by_scope": {"tick": 8, "session": 4},
# }

Integration with Tools

Tools declare cacheability:

class InspectLocationTool(Tool):
    name = "inspect_location"
    category = ToolCategory.SAFE_CHAIN
    cacheable = True           # Enables caching
    cache_ttl = 60.0           # TTL in seconds
    cache_scope = "tick"       # Scope for this tool

3. Integration Points

LLM Client Integration

The UnifiedLLMClient accepts a circuit breaker:

from evennia.contrib.base_systems.ai.llm import UnifiedLLMClient
from evennia.contrib.base_systems.ai.resilience import CircuitBreaker

breaker = CircuitBreaker("llm")
client = UnifiedLLMClient(
    provider="openai",
    circuit_breaker=breaker,
    ...
)

The client automatically:

  • Checks breaker.is_available before requests
  • Calls breaker.record_success() on 200 responses
  • Calls breaker.record_failure() on retryable errors

Tool Execution Integration

In tool_execution.py:

# Check cache before execution
cache_key = get_cache_key(tool.name, **parameters)
cached = script.ndb.tool_cache.get(cache_key)
if cached is not None and tool.cacheable:
    return cached

# Execute tool
result = yield tool.execute(...)

# Cache result
if tool.cacheable and result.get("success"):
    script.ndb.tool_cache.set(
        cache_key,
        result,
        scope=tool.cache_scope,
        ttl=tool.cache_ttl
    )

Key Files

File Lines Purpose
resilience/circuit_breaker.py 56-70 CircuitBreakerState enum
resilience/circuit_breaker.py 72-93 CircuitBreakerConfig dataclass
resilience/circuit_breaker.py 95-115 CircuitOpenError exception
resilience/circuit_breaker.py 117-332 CircuitBreaker class
resilience/circuit_breaker.py 334-420 CircuitBreakerRegistry
resilience/caching.py 35-58 CacheEntry dataclass
resilience/caching.py 60-227 ToolResultCache class
resilience/caching.py 229-252 get_cache_key() helper

See also: Architecture-LLM-Providers | Architecture-Tool-System | Data-Flow-08-LLM-Provider-Interaction