Microservices aren't magic — they're a trade-off. This post covers 5 hard-won principles from building 20+ production services: right service boundaries, Flask Blueprint structure, async task queues, inter-service communication, and centralized auth. Skip to any principle that's relevant to you.
The Monolith That Couldn't Keep Up
Picture this: it's 3 AM, and your on-call engineer is knee-deep in a deployment that's taking down the entire platform because a single credential-rendering bug brought along the notification system, the API gateway, and the user dashboard with it. That was us at CertifyMe — before we made the switch.
When I joined, the backend was a growing monolith. As the platform scaled to support enterprise credential issuance for tens of thousands of users across 100+ organizations, it became clear that a microservices architecture wasn't just preferable — it was necessary for survival. Over the following year, I led the design and implementation of 20+ production microservices. Here's everything I wish I'd known before we started.
Microservices add real operational complexity. Only make the move if you have clear pain points: deployment coupling, team velocity bottlenecks, or scaling needs that a monolith genuinely can't solve. Don't do it for the architecture diagram.
Principle 1: Define Service Boundaries by Business Capability
The most common (and painful) mistake is splitting services by technical layer — "a database service", "a validation service", "a utils service". This creates distributed coupling that's worse than a monolith.
Instead, split by business capability. Ask: what does this service own? At CertifyMe, our service map looked like this:
- Credential Issuance Service — the full issuance lifecycle
- Verification Service — public-facing credential verification & blockchain anchoring
- Notification Service — email, webhook, and in-app delivery
- Integration Gateway — third-party LMS & HR system connections
Each service owns its data, its logic, and its API. Zero shared databases. This is the hardest rule to enforce but the most important one — the moment two services share a table, you've just built a distributed monolith.
Principle 2: Use Flask Blueprints for Internal Structure
Flask is deliberately minimal. That's a feature, not a bug — but it means structure is entirely your responsibility. We found Blueprints invaluable for keeping each service clean and testable:
from flask import Blueprint
credential_bp = Blueprint('credentials', __name__, url_prefix='/credentials')
@credential_bp.route('/issue', methods=['POST'])
def issue_credential():
# Single responsibility: orchestrate issuance
pass
# Register in the app factory — never globally
def create_app():
app = Flask(__name__)
app.register_blueprint(credential_bp)
return app
Each Blueprint maps to a bounded context within the service. The app factory pattern makes testing individual blueprints trivial — just create the app in test mode and hit the routes directly.
Keep your route handlers thin. They should do one thing: validate input, call a service layer function, return the response. All business logic lives in the service layer, not in route handlers. This makes unit testing your logic effortless.
Principle 3: Async Task Queues for Heavy Work
Credential issuance involves PDF generation, blockchain anchoring, and email delivery. These operations take 2–8 seconds each. None of them should block an HTTP response — your users will think the site is broken.
We solved this with Celery backed by Redis as the message broker:
from celery import Celery
celery = Celery('tasks', broker='redis://localhost:6379/0')
@celery.task(bind=True, max_retries=3)
def issue_credential_async(self, credential_id):
try:
pdf = generate_pdf(credential_id)
anchor_on_blockchain(credential_id, pdf.hash)
send_email(credential_id, pdf.url)
except Exception as exc:
raise self.retry(exc=exc, countdown=60) # retry in 60s
This kept API response times under 200ms even for the most complex issuance workflows. The user gets an immediate "processing" response, and the heavy work happens in the background with automatic retries on failure.
Principle 4: Standardize Inter-Service Communication
We used two patterns: REST for synchronous calls (when you need an immediate answer) and Redis pub/sub for event-driven flows (when you just need to notify).
The non-negotiable rule: never call another service's database directly. Always go through its API. This is the hardest discipline to maintain, especially under deadline pressure. But the day you break it is the day you start accumulating hidden coupling that will bite you six months later during a schema migration.
Synchronous REST for queries that need immediate results. Redis pub/sub events for state changes that other services react to. Example: when a credential is issued, the Issuance Service publishes a credential.issued event — the Notification Service and Analytics Service both listen and react independently. Neither service needs to know the other exists.
Principle 5: Centralized Auth, Distributed Enforcement
We implemented OAuth2 at the API gateway level and passed JWT tokens downstream. Each service validated the token locally using a shared secret — no round-trips to an auth service on every request. This pattern keeps latency low while maintaining security:
import jwt
from functools import wraps
from flask import request, g, abort
def require_auth(f):
@wraps(f)
def decorated(*args, **kwargs):
token = request.headers.get('Authorization', '').replace('Bearer ', '')
if not token:
abort(401)
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
g.user = payload
except jwt.ExpiredSignatureError:
abort(401, 'Token expired')
except jwt.InvalidTokenError:
abort(401, 'Invalid token')
return f(*args, **kwargs)
return decorated
We also implemented AES-256 encryption for credential payloads in transit and at rest — a compliance requirement for enterprise customers that's much easier to add at the service level than to retrofit into a monolith.
Operational Realities Nobody Talks About
The five principles above will get you a working microservices architecture. These lessons will save you from 3 AM incidents:
Add distributed tracing on day one. When a request touches 4 services and something goes wrong, you need a correlation ID that follows the request through every log. We added this retroactively and it was painful. Use OpenTelemetry from the start.
Idempotency keys are not optional. Networks fail. Clients retry. Without idempotency keys on your write operations, you will issue duplicate credentials. We learned this the hard way.
Health check endpoints on every service. A simple /health endpoint returning 200 saves hours during incidents. Your load balancer, your monitoring, and your sanity will thank you.
Key Takeaways
- Split by business capability, never by technical layer
- No shared databases — ever. Own your data or don't have it
- Celery + Redis for anything that takes > 500ms
- REST for synchronous queries, pub/sub for events
- JWT local validation — no auth service round-trips on every request
- Distributed tracing, idempotency keys, and health checks from day one