LLMs are probabilistic systems running on top of deterministic infrastructure. The mistakes are different. Key lessons: always validate LLM output structure (don't trust JSON), cache aggressively (same prompt = same response 95% of the time), handle rate limits and context window limits gracefully, and build observability from day one — you cannot debug what you cannot see.
The Illusion of Simplicity
The first LLM feature I shipped at CertifyMe took two days: an AI-powered credential description generator. The demo was flawless. The week after launch, it started silently returning empty descriptions for 8% of users. Support tickets piled up. We had no idea why.
The cause? The LLM was occasionally returning a JSON response with the description field spelled descripton (a typo in its output). Our code did response['description'], got a KeyError, the exception handler swallowed it silently, and users got a blank field. Classic. But classic in a way that normal software bugs aren't — because it was non-deterministic, happened 8% of the time, and was invisible without structured logging.
That incident taught me the most important lesson in LLM engineering: LLMs are probabilistic. Your system around them must be defensive.
Lesson 1: Validate LLM Output Structure — Always
Never trust that an LLM will return the exact format you asked for. Even with explicit JSON instructions, models occasionally produce prose, markdown-wrapped JSON, or structurally valid JSON with missing fields. Use Pydantic for structural validation:
from pydantic import BaseModel, ValidationError
from openai import OpenAI
import json
class CredentialDescription(BaseModel):
short_description: str
highlights: list[str]
audience: str
def generate_description(credential_data: dict) -> CredentialDescription | None:
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "system",
"content": "Return a JSON with keys: short_description (str), highlights (list of strings), audience (str)"
}, {
"role": "user",
"content": f"Generate a description for: {json.dumps(credential_data)}"
}]
)
raw = response.choices[0].message.content
try:
data = json.loads(raw)
return CredentialDescription(**data)
except (json.JSONDecodeError, ValidationError) as e:
logger.error("LLM output validation failed", extra={"raw": raw, "error": str(e)})
return None # Caller handles the None case explicitly
response_format: json_object
When available (GPT-4o, GPT-4 Turbo), this mode guarantees the response is valid JSON. It won't guarantee the structure matches your schema, but it eliminates the most common failure mode: the model wrapping JSON in markdown code blocks.
Lesson 2: Semantic Caching Cuts Costs Dramatically
LLM API calls are expensive. But many production workloads are semantically repetitive — the same credential type, the same user question phrased slightly differently. Exact-match caching helps for identical prompts. Semantic caching helps for similar prompts:
import redis
import numpy as np
def semantic_cache_get(query: str, embedder, threshold=0.95) -> str | None:
query_emb = np.array(embedder.embed(query))
# Scan recent cached queries (use Redis SCAN in production)
for key in r.scan_iter("llm_cache:*"):
cached = json.loads(r.get(key))
cached_emb = np.array(cached['embedding'])
similarity = np.dot(query_emb, cached_emb)
if similarity >= threshold:
return cached['response']
return None
def llm_with_semantic_cache(prompt: str, embedder, llm_fn) -> str:
cached = semantic_cache_get(prompt, embedder)
if cached:
return cached
response = llm_fn(prompt)
emb = embedder.embed(prompt)
r.setex(
f"llm_cache:{hash(prompt)}",
3600,
json.dumps({'embedding': emb, 'response': response})
)
return response
With a 0.95 cosine similarity threshold on credential description generation, we achieved a 71% cache hit rate in the first month — cutting LLM costs by roughly $800/month at our request volume.
Lesson 3: Handle Rate Limits with Exponential Backoff
OpenAI, Anthropic, and every other LLM provider will rate-limit you. In production, you will hit these limits. Without retry logic, a rate-limit error becomes a user-visible failure. With exponential backoff, it's transparent:
import time
import random
from openai import RateLimitError, APIError
def call_with_backoff(fn, max_retries=4):
for attempt in range(max_retries):
try:
return fn()
except RateLimitError:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1) # jitter
logger.warning(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt+1})")
time.sleep(wait)
except APIError as e:
if e.status_code >= 500: # server errors — retry
time.sleep(2 ** attempt)
else: # client errors (4xx) — don't retry
raise
Lesson 4: Context Window Management
Every LLM has a context window limit. If your input exceeds it, the call fails. More subtly, as context grows, cost and latency grow linearly — and model performance often degrades near the limit. Always count tokens before sending:
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def truncate_to_budget(documents: list[str], budget_tokens: int = 3000, model="gpt-4o") -> list[str]:
selected, total = [], 0
for doc in documents:
tokens = count_tokens(doc, model)
if total + tokens > budget_tokens:
break
selected.append(doc)
total += tokens
return selected
LLMs perform worse at recalling information from the middle of long contexts than from the beginning or end. When constructing RAG prompts, put the most important context either first or last — not buried in the middle of a 10-document context block.
Lesson 5: Observability Is Not Optional
You cannot debug LLM applications without structured logs that capture: the full prompt, the raw response, the parsed output, latency, token usage, model version, and any validation failures. Without this, a 8% failure rate is invisible until it becomes a 80% failure rate.
import structlog
logger = structlog.get_logger()
def logged_llm_call(prompt: str, **kwargs) -> dict:
start = time.time()
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
**kwargs
)
latency_ms = (time.time() - start) * 1000
logger.info("llm_call_complete",
model="gpt-4o",
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
latency_ms=round(latency_ms, 1),
finish_reason=response.choices[0].finish_reason,
)
return response
Key Takeaways
- Always validate LLM output structure with Pydantic — never assume the format is correct
- Semantic caching at 0.95 cosine similarity can achieve 60-70%+ cache hit rates on repetitive workloads
- Exponential backoff with jitter is mandatory for handling rate limits transparently
- Count tokens before sending — context window overflows are silent in naive implementations
- Important context goes first or last in the prompt, not in the middle
- Structured logging (prompt, response, tokens, latency) is the minimum viable observability