18 months, 20+ services, hundreds of enterprise customers, and more 3 AM incidents than I'd like to admit. The decisions that paid off: service isolation, async-first design, Redis caching from day one, and structured logging. The mistakes that cost us: over-engineering early, skipping idempotency, and building observability after the fires instead of before. Here's the full retrospective.
The Context
CertifyMe is a digital credential platform — we issue, verify, and manage digital certificates and badges for universities, training providers, and enterprise customers. When I joined, the backend was a Flask monolith serving a growing user base. By the time I'm writing this, the backend is a fleet of 20+ microservices handling credential issuance, verification, LMS integrations, webhooks, analytics, and more — serving tens of thousands of credentials per month across 100+ organizations.
This is a retrospective, not a success story. The successes are real, but so are the mistakes. Both have more to teach.
What Paid Off: Service Isolation
The decision to isolate our public-facing credential verification service from the issuance backend early paid the biggest dividends. Verification is our most traffic-variable endpoint — QR codes get scanned by thousands of people at graduation ceremonies. Issuance is bursty but predictable.
Isolation meant: when our PDF generation service had a memory leak at 2 AM during a large batch issuance, it didn't take down verification. When a third-party LMS integration went rogue and hammered our API, the integration gateway rate-limited itself — the core issuance pipeline was unaffected.
Isolation also meant we could scale each service independently. Verification runs 6 replicas during peak season. The analytics service runs 1. Without isolation, we'd be scaling everything when anything needed more capacity.
What Paid Off: Async-First Design
Making Celery the default for any operation over 500ms from the very first service was the best architectural decision we made. It forced a design discipline: HTTP handlers return fast, background workers do heavy lifting, webhooks notify when done.
This design is why our API response times are consistently under 200ms even for complex operations. Users never wait for PDF generation or blockchain anchoring. Support tickets about "the site is slow" are essentially nonexistent — because slowness happens invisibly in the background, not blocking the user's interface.
What Paid Off: Structured Logging from Day One
We used structlog from the very first service and mandated that every log line include: service name, request ID, user ID, org ID, operation, and duration. This seemed like overhead in the early days. Three months in, when we were debugging an intermittent 0.5% issuance failure, having structured logs let us filter to exactly those 17 failed requests and identify the common thread (a specific LMS integration format) in 20 minutes. Without structured logs, that investigation takes days.
What Cost Us: Over-Engineering Early
Our biggest mistake was building a full event sourcing system for the first two services — capturing every state change as an immutable event, maintaining projections, the full CQRS pattern. It took six weeks to build and maintain. It solved problems we didn't have. When we needed to ship the integration gateway fast, we didn't have six weeks. We'd already spent them on architectural sophistication that added zero business value at our scale.
The lesson I carry forward: build the simplest thing that will work at 10x your current scale. Not 100x. At 10x current scale, what will break? Fix that. Everything else is speculation.
"You Aren't Gonna Need It" hits differently when "it" is an extra microservice, an event-sourcing architecture, or a distributed cache layer. The operational complexity of each additional piece is real and ongoing. Build it when you need it. The cost of adding it later is almost always less than the cost of maintaining it when you don't need it yet.
What Cost Us: Skipping Idempotency
We didn't implement idempotency keys on credential issuance for the first four months. The argument was "we'll add it when we need it." We needed it on month two — a network timeout during a large batch issuance caused the client to retry, and 847 users received duplicate credentials. We spent two days on a cleanup script, and a week regaining the trust of that enterprise customer.
Idempotency for create operations is a 2-hour implementation. The incident it prevents is a 2-day recovery. The math is obvious in retrospect.
What Cost Us: Building Observability Reactively
We added distributed tracing after our first multi-service debugging nightmare. We added alerting after the verification service went down silently for 40 minutes. We added dashboards after a capacity incident. Every piece of observability infrastructure was built in response to a specific incident.
The right approach: build the minimum viable observability layer before the first service goes to production. That means: centralized logs with request IDs, one dashboard per service (error rate, p99 latency, queue depth), and alerts on SLA breaches. Not after you need it. Before.
What Cost Us: Shared Database Between Two Services
Early on, two services shared a MySQL database "temporarily, just for now." That "temporary" configuration lasted eleven months. During a schema migration for one service, it broke the other. We had to coordinate deployments between two services that were supposed to be independent. The migration took a full day of carefully orchestrated steps instead of a standard deploy.
The rule is simple and absolute: one service, one database. The day you break it, you start accumulating coupling debt that compounds interest until you finally pay it back — usually at the worst possible time.
The Stack in Retrospect
If I were starting fresh today with the same requirements:
- Python + Flask — still the right choice for our team's expertise and iteration speed
- Celery + Redis — battle-tested, exactly right for our async patterns
- MySQL — solid for our relational data model; I'd evaluate PostgreSQL more seriously today for its richer feature set (JSONB, full-text search, better JSON operators)
- Docker + ECS Fargate — right choice, though I'd invest in proper infrastructure-as-code (Terraform) from day one instead of clicking through the console
- OpenTelemetry — I'd mandate this from service one. Plugging in traces retroactively is painful
Kubernetes felt exciting when we evaluated it. ECS Fargate felt boring. We chose Fargate and never looked back — zero container orchestration incidents in 18 months, ops overhead close to zero. The boring choice is often boring because it solves the problem without new problems. In production infrastructure, boring is underrated.
Key Takeaways
- Service isolation and async-first design provide the highest architectural ROI
- Structured logging from day one transforms multi-day debugging into 20-minute investigations
- Over-engineering at early scale is more expensive than under-engineering — build for 10x, not 100x
- Idempotency for write operations is a 2-hour investment that prevents 2-day incidents
- Build observability (logs, dashboards, alerts) before the first service ships, not after the first incident
- One service, one database — the shared database exception always becomes a shared-state nightmare