LLMsAIPythonBackendProduction

LLM Engineering in Production: What Nobody Tells You Until It Breaks

8 May 2025·10 min read·Harshit Gupta

TL;DR

LLMs are probabilistic systems running on top of deterministic infrastructure. The mistakes are different. Key lessons: always validate LLM output structure (don't trust JSON), cache aggressively (same prompt = same response 95% of the time), handle rate limits and context window limits gracefully, and build observability from day one — you cannot debug what you cannot see.

The Illusion of Simplicity

The first LLM feature I shipped at CertifyMe took two days: an AI-powered credential description generator. The demo was flawless. The week after launch, it started silently returning empty descriptions for 8% of users. Support tickets piled up. We had no idea why.

The cause? The LLM was occasionally returning a JSON response with the description field spelled descripton (a typo in its output). Our code did response['description'], got a KeyError, the exception handler swallowed it silently, and users got a blank field. Classic. But classic in a way that normal software bugs aren't — because it was non-deterministic, happened 8% of the time, and was invisible without structured logging.

That incident taught me the most important lesson in LLM engineering: LLMs are probabilistic. Your system around them must be defensive.

Lesson 1: Validate LLM Output Structure — Always

Never trust that an LLM will return the exact format you asked for. Even with explicit JSON instructions, models occasionally produce prose, markdown-wrapped JSON, or structurally valid JSON with missing fields. Use Pydantic for structural validation:

from pydantic import BaseModel, ValidationError
from openai import OpenAI
import json

class CredentialDescription(BaseModel):
    short_description: str
    highlights: list[str]
    audience: str

def generate_description(credential_data: dict) -> CredentialDescription | None:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[{
            "role": "system",
            "content": "Return a JSON with keys: short_description (str), highlights (list of strings), audience (str)"
        }, {
            "role": "user",
            "content": f"Generate a description for: {json.dumps(credential_data)}"
        }]
    )

    raw = response.choices[0].message.content
    try:
        data = json.loads(raw)
        return CredentialDescription(**data)
    except (json.JSONDecodeError, ValidationError) as e:
        logger.error("LLM output validation failed", extra={"raw": raw, "error": str(e)})
        return None  # Caller handles the None case explicitly

Use response_format: json_object

When available (GPT-4o, GPT-4 Turbo), this mode guarantees the response is valid JSON. It won't guarantee the structure matches your schema, but it eliminates the most common failure mode: the model wrapping JSON in markdown code blocks.

Lesson 2: Semantic Caching Cuts Costs Dramatically

LLM API calls are expensive. But many production workloads are semantically repetitive — the same credential type, the same user question phrased slightly differently. Exact-match caching helps for identical prompts. Semantic caching helps for similar prompts:

import redis
import numpy as np

def semantic_cache_get(query: str, embedder, threshold=0.95) -> str | None:
    query_emb = np.array(embedder.embed(query))
    # Scan recent cached queries (use Redis SCAN in production)
    for key in r.scan_iter("llm_cache:*"):
        cached = json.loads(r.get(key))
        cached_emb = np.array(cached['embedding'])
        similarity = np.dot(query_emb, cached_emb)
        if similarity >= threshold:
            return cached['response']
    return None

def llm_with_semantic_cache(prompt: str, embedder, llm_fn) -> str:
    cached = semantic_cache_get(prompt, embedder)
    if cached:
        return cached
    response = llm_fn(prompt)
    emb = embedder.embed(prompt)
    r.setex(
        f"llm_cache:{hash(prompt)}",
        3600,
        json.dumps({'embedding': emb, 'response': response})
    )
    return response

With a 0.95 cosine similarity threshold on credential description generation, we achieved a 71% cache hit rate in the first month — cutting LLM costs by roughly $800/month at our request volume.

Lesson 3: Handle Rate Limits with Exponential Backoff

OpenAI, Anthropic, and every other LLM provider will rate-limit you. In production, you will hit these limits. Without retry logic, a rate-limit error becomes a user-visible failure. With exponential backoff, it's transparent:

import time
import random
from openai import RateLimitError, APIError

def call_with_backoff(fn, max_retries=4):
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)  # jitter
            logger.warning(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt+1})")
            time.sleep(wait)
        except APIError as e:
            if e.status_code >= 500:  # server errors — retry
                time.sleep(2 ** attempt)
            else:  # client errors (4xx) — don't retry
                raise

Lesson 4: Context Window Management

Every LLM has a context window limit. If your input exceeds it, the call fails. More subtly, as context grows, cost and latency grow linearly — and model performance often degrades near the limit. Always count tokens before sending:

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def truncate_to_budget(documents: list[str], budget_tokens: int = 3000, model="gpt-4o") -> list[str]:
    selected, total = [], 0
    for doc in documents:
        tokens = count_tokens(doc, model)
        if total + tokens > budget_tokens:
            break
        selected.append(doc)
        total += tokens
    return selected

The "lost in the middle" problem

LLMs perform worse at recalling information from the middle of long contexts than from the beginning or end. When constructing RAG prompts, put the most important context either first or last — not buried in the middle of a 10-document context block.

Lesson 5: Observability Is Not Optional

You cannot debug LLM applications without structured logs that capture: the full prompt, the raw response, the parsed output, latency, token usage, model version, and any validation failures. Without this, a 8% failure rate is invisible until it becomes a 80% failure rate.

import structlog

logger = structlog.get_logger()

def logged_llm_call(prompt: str, **kwargs) -> dict:
    start = time.time()
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )
    latency_ms = (time.time() - start) * 1000

    logger.info("llm_call_complete",
        model="gpt-4o",
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=response.usage.completion_tokens,
        latency_ms=round(latency_ms, 1),
        finish_reason=response.choices[0].finish_reason,
    )
    return response

Key Takeaways

Always validate LLM output structure with Pydantic — never assume the format is correct
Semantic caching at 0.95 cosine similarity can achieve 60-70%+ cache hit rates on repetitive workloads
Exponential backoff with jitter is mandatory for handling rate limits transparently
Count tokens before sending — context window overflows are silent in naive implementations
Important context goes first or last in the prompt, not in the middle
Structured logging (prompt, response, tokens, latency) is the minimum viable observability

Back to All Posts

Written by Harshit Gupta