1 Architecture LLM Providers
blightbow edited this page 2025-12-08 04:10:03 +00:00

Architecture: LLM Providers

Layer 2 - Unified LLM Client and Provider Abstraction


Overview

The LLM provider system provides a Twisted-native, provider-agnostic interface for language model API calls:

  • UnifiedLLMClient - Single interface for all providers
  • LLMProvider ABC - Provider-specific request/response formatting
  • Response models - Standardized output with token usage and rate limits
  • Retry logic - Exponential backoff with jitter
  • Circuit breaker integration - Fault isolation (see Architecture-Resilience-System)

1. UnifiedLLMClient

The main interface for LLM API calls (llm/client.py).

Initialization

from evennia.contrib.base_systems.ai.llm import UnifiedLLMClient

client = UnifiedLLMClient(
    provider="openai",           # "openai" | "anthropic" | "openrouter" | "ollama" | "local"
    auth_token="sk-...",         # API key
    model="gpt-4",               # Model name
    retry_config=RetryConfig(),  # Optional retry settings
    circuit_breaker=breaker,     # Optional CircuitBreaker
    debug=False,                 # Enable debug logging
    # Provider-specific options:
    app_name="My App",           # OpenRouter X-Title header
    site_url="https://...",      # OpenRouter HTTP-Referer
)

Chat Completion

@inlineCallbacks
def example():
    response = yield client.chat_completion(
        messages=[
            {"role": "system", "content": "You are helpful."},
            {"role": "user", "content": "Hello!"}
        ],
        tools=[...],             # Optional tool schemas
        tool_choice="auto",      # "auto" | "none" | specific tool
        temperature=0.7,         # Provider-specific kwargs
        max_tokens=1000,
    )

    print(response.content)
    print(response.tool_calls)
    print(response.usage.total_tokens)

Key Properties

Property Type Description
provider_name str Current provider name
model str Current model name
supports_tools bool Whether provider supports tool calling
rate_limit_info RateLimitInfo Rate limits from last request
circuit_breaker CircuitBreaker Circuit breaker if configured

Token Counting

# Uses provider-specific tokenizer (tiktoken for OpenAI, estimation for others)
tokens = client.count_tokens("Some text to count")

Tool Result Formatting

# Format tool result message for current provider
tool_msg = client.format_tool_result(tool_call_id="call_123", result={"status": "ok"})
# Returns provider-specific format:
# OpenAI:    {"role": "tool", "tool_call_id": "...", "content": "..."}
# Anthropic: {"role": "user", "content": [{"type": "tool_result", ...}]}

2. LLMProvider ABC

Abstract base class for provider implementations (llm/providers/base.py).

Required Methods

Method Purpose
build_headers(auth_token) Build Twisted-compatible request headers
build_request(messages, tools, **kwargs) Build provider-specific request body
parse_response(response_json, headers) Parse response into LLMResponse
count_tokens(text) Count tokens using provider's tokenizer
format_tool_result(tool_call_id, result) Format tool result message

Provider Metadata

class MyProvider(LLMProvider):
    name = "myprovider"
    supports_tools = True      # Tool calling support
    supports_streaming = True  # Streaming support
    supports_chat = True       # Chat format support

Optional Methods

Method Default Behavior
get_api_url() Returns self.api_url
parse_rate_limit_headers(headers) Creates RateLimitInfo from standard headers
parse_error_response(response_json) Extracts error message from response

3. Available Providers

Factory function get_provider() creates provider instances.

Provider Name Default Model Tool Support
OpenAI "openai" gpt-4 Yes
Anthropic "anthropic" claude-3-5-sonnet-20241022 Yes
OpenRouter "openrouter" openai/gpt-4o-mini Yes
Ollama "ollama" llama3.2 Varies
Local "local" None No

OpenAI Provider

Supports OpenAI API and OpenAI-compatible endpoints (Azure, etc.).

# Standard OpenAI
client = UnifiedLLMClient(provider="openai", auth_token="sk-...", model="gpt-4")

# Azure OpenAI (custom endpoint)
client = UnifiedLLMClient(
    provider="openai",
    api_url="https://myinstance.openai.azure.com/...",
    auth_token="...",
    model="gpt-4"
)

Anthropic Provider

Native Claude API support with prompt caching.

client = UnifiedLLMClient(
    provider="anthropic",
    auth_token="sk-ant-...",
    model="claude-3-5-sonnet-20241022"
)

OpenRouter Provider

Multi-provider gateway with attribution headers.

client = UnifiedLLMClient(
    provider="openrouter",
    auth_token="sk-or-...",
    model="anthropic/claude-3.5-sonnet",
    app_name="My Evennia Game",      # X-Title header
    site_url="https://mygame.com",   # HTTP-Referer header
)

Local Providers

For local model servers (Ollama, text-generation-webui).

# Ollama
client = UnifiedLLMClient(
    provider="ollama",
    api_url="http://localhost:11434/api/chat",
    model="llama3.2"
)

# Generic local (OpenAI-compatible)
client = UnifiedLLMClient(
    provider="local",
    api_url="http://localhost:5000/v1/chat/completions",
    model="my-model"
)

4. Response Models

Standardized response structures (llm/responses.py).

LLMResponse

@dataclass
class LLMResponse:
    content: str = ""                        # Text content
    tool_calls: List[ToolCall] = []          # Parsed tool calls
    finish_reason: str = ""                  # "stop" | "tool_calls" | "error" | ...
    usage: Optional[TokenUsage] = None       # Token statistics
    rate_limit_info: Optional[RateLimitInfo] = None
    model: str = ""                          # Model used
    provider: str = ""                       # Provider name
    raw_response: Optional[Dict] = None      # Original response

    @property
    def has_tool_calls(self) -> bool:
        return len(self.tool_calls) > 0

ToolCall

@dataclass
class ToolCall:
    id: str                    # Tool call ID for result pairing
    name: str                  # Tool name
    arguments: Dict[str, Any]  # Parsed arguments

TokenUsage

@dataclass
class TokenUsage:
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    # Cache statistics (provider-specific)
    cache_creation_tokens: int = 0   # Anthropic: tokens written to cache
    cache_read_tokens: int = 0       # Anthropic: tokens from cache
    cached_tokens: int = 0           # OpenAI: automatic caching

    @property
    def cache_hit_tokens(self) -> int:
        """Total tokens served from cache (works for both providers)."""
        return self.cache_read_tokens + self.cached_tokens

RateLimitInfo

@dataclass
class RateLimitInfo:
    remaining_tokens: Optional[int] = None
    remaining_requests: Optional[int] = None
    reset_tokens: Optional[str] = None
    reset_requests: Optional[str] = None
    limit_tokens: Optional[int] = None      # OpenRouter
    limit_requests: Optional[int] = None    # OpenRouter

    def is_approaching_limit(self, token_threshold=1000, request_threshold=10) -> bool:
        """Check if approaching rate limits."""

5. Retry Configuration

Exponential backoff with "Full Jitter" algorithm (llm/responses.py).

@dataclass
class RetryConfig:
    max_attempts: int = 3
    backoff_base: float = 1.0          # Initial delay (seconds)
    backoff_max: float = 30.0          # Maximum delay
    backoff_multiplier: float = 2.0    # Exponential factor
    jitter: bool = True                # Random jitter (prevents thundering herd)
    transient_status_codes: Set[int] = {429, 500, 502, 503, 504}

Delay Calculation

With jitter enabled (default):
  delay = random(0, min(backoff_max, backoff_base * 2^attempt))

Without jitter:
  delay = min(backoff_max, backoff_base * 2^attempt)

Retryable Errors

Only these HTTP status codes trigger retry:

  • 429 - Rate limited
  • 500 - Internal server error
  • 502 - Bad gateway
  • 503 - Service unavailable
  • 504 - Gateway timeout

6. Request Flow

client.chat_completion(messages, tools, **kwargs)
    │
    ▼
┌──────────────────────────────────────────────────────────────────┐
│  UnifiedLLMClient._request_with_retry()                          │
├──────────────────────────────────────────────────────────────────┤
│  1. Check circuit breaker                                        │
│       if breaker.is_available == False:                          │
│           raise CircuitOpenError                                 │
│                                                                  │
│  2. For attempt in range(max_attempts):                          │
│       status, body, headers = _make_request(body)                │
│                                                                  │
│       if status == 200:                                          │
│           breaker.record_success()                               │
│           return (status, body, headers)                         │
│                                                                  │
│       if is_retryable(status):                                   │
│           breaker.record_failure()                               │
│           delay = get_delay(attempt)  # with jitter              │
│           yield deferLater(delay)                                │
│                                                                  │
│  3. All retries exhausted → return error response                │
└──────────────────────────────────────────────────────────────────┘
    │
    ▼
client._parse_response(status_code, response_bytes, headers)
    │
    ├── Update rate_limit_info from headers
    ├── Log warning if approaching limits
    ├── Handle error responses (status != 200)
    └── Parse successful response via provider.parse_response()
    │
    ▼
LLMResponse (standardized output)

Key Files

File Lines Purpose
llm/client.py 1-432 UnifiedLLMClient implementation
llm/providers/base.py 1-174 LLMProvider ABC
llm/providers/__init__.py 1-70 get_provider() factory
llm/providers/openai.py 1-219 OpenAI/compatible provider
llm/providers/anthropic.py Anthropic Claude provider
llm/providers/openrouter.py OpenRouter provider
llm/providers/local.py Ollama and local server providers
llm/responses.py 1-170 Response dataclasses

See also: Architecture-Resilience-System | Architecture-Core-Engine | Data-Flow-08-LLM-Provider-Interaction