PythonFlaskMicroservicesBackend

Building Production-Grade Microservices with Python and Flask

10 March 2025·9 min read·Harshit Gupta

TL;DR

Microservices aren't magic — they're a trade-off. This post covers 5 hard-won principles from building 20+ production services: right service boundaries, Flask Blueprint structure, async task queues, inter-service communication, and centralized auth. Skip to any principle that's relevant to you.

The Monolith That Couldn't Keep Up

Picture this: it's 3 AM, and your on-call engineer is knee-deep in a deployment that's taking down the entire platform because a single credential-rendering bug brought along the notification system, the API gateway, and the user dashboard with it. That was us at CertifyMe — before we made the switch.

When I joined, the backend was a growing monolith. As the platform scaled to support enterprise credential issuance for tens of thousands of users across 100+ organizations, it became clear that a microservices architecture wasn't just preferable — it was necessary for survival. Over the following year, I led the design and implementation of 20+ production microservices. Here's everything I wish I'd known before we started.

Before you refactor

Microservices add real operational complexity. Only make the move if you have clear pain points: deployment coupling, team velocity bottlenecks, or scaling needs that a monolith genuinely can't solve. Don't do it for the architecture diagram.

Principle 1: Define Service Boundaries by Business Capability

The most common (and painful) mistake is splitting services by technical layer — "a database service", "a validation service", "a utils service". This creates distributed coupling that's worse than a monolith.

Instead, split by business capability. Ask: what does this service own? At CertifyMe, our service map looked like this:

Credential Issuance Service — the full issuance lifecycle
Verification Service — public-facing credential verification & blockchain anchoring
Notification Service — email, webhook, and in-app delivery
Integration Gateway — third-party LMS & HR system connections

Each service owns its data, its logic, and its API. Zero shared databases. This is the hardest rule to enforce but the most important one — the moment two services share a table, you've just built a distributed monolith.

Principle 2: Use Flask Blueprints for Internal Structure

Flask is deliberately minimal. That's a feature, not a bug — but it means structure is entirely your responsibility. We found Blueprints invaluable for keeping each service clean and testable:

from flask import Blueprint

credential_bp = Blueprint('credentials', __name__, url_prefix='/credentials')

@credential_bp.route('/issue', methods=['POST'])
def issue_credential():
    # Single responsibility: orchestrate issuance
    pass

# Register in the app factory — never globally
def create_app():
    app = Flask(__name__)
    app.register_blueprint(credential_bp)
    return app

Each Blueprint maps to a bounded context within the service. The app factory pattern makes testing individual blueprints trivial — just create the app in test mode and hit the routes directly.

Pro tip

Keep your route handlers thin. They should do one thing: validate input, call a service layer function, return the response. All business logic lives in the service layer, not in route handlers. This makes unit testing your logic effortless.

Principle 3: Async Task Queues for Heavy Work

Credential issuance involves PDF generation, blockchain anchoring, and email delivery. These operations take 2–8 seconds each. None of them should block an HTTP response — your users will think the site is broken.

We solved this with Celery backed by Redis as the message broker:

from celery import Celery

celery = Celery('tasks', broker='redis://localhost:6379/0')

@celery.task(bind=True, max_retries=3)
def issue_credential_async(self, credential_id):
    try:
        pdf = generate_pdf(credential_id)
        anchor_on_blockchain(credential_id, pdf.hash)
        send_email(credential_id, pdf.url)
    except Exception as exc:
        raise self.retry(exc=exc, countdown=60)  # retry in 60s

This kept API response times under 200ms even for the most complex issuance workflows. The user gets an immediate "processing" response, and the heavy work happens in the background with automatic retries on failure.

Principle 4: Standardize Inter-Service Communication

We used two patterns: REST for synchronous calls (when you need an immediate answer) and Redis pub/sub for event-driven flows (when you just need to notify).

The non-negotiable rule: never call another service's database directly. Always go through its API. This is the hardest discipline to maintain, especially under deadline pressure. But the day you break it is the day you start accumulating hidden coupling that will bite you six months later during a schema migration.

Architecture pattern

Synchronous REST for queries that need immediate results. Redis pub/sub events for state changes that other services react to. Example: when a credential is issued, the Issuance Service publishes a credential.issued event — the Notification Service and Analytics Service both listen and react independently. Neither service needs to know the other exists.

Principle 5: Centralized Auth, Distributed Enforcement

We implemented OAuth2 at the API gateway level and passed JWT tokens downstream. Each service validated the token locally using a shared secret — no round-trips to an auth service on every request. This pattern keeps latency low while maintaining security:

import jwt
from functools import wraps
from flask import request, g, abort

def require_auth(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        token = request.headers.get('Authorization', '').replace('Bearer ', '')
        if not token:
            abort(401)
        try:
            payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
            g.user = payload
        except jwt.ExpiredSignatureError:
            abort(401, 'Token expired')
        except jwt.InvalidTokenError:
            abort(401, 'Invalid token')
        return f(*args, **kwargs)
    return decorated

We also implemented AES-256 encryption for credential payloads in transit and at rest — a compliance requirement for enterprise customers that's much easier to add at the service level than to retrofit into a monolith.

Operational Realities Nobody Talks About

The five principles above will get you a working microservices architecture. These lessons will save you from 3 AM incidents:

Add distributed tracing on day one. When a request touches 4 services and something goes wrong, you need a correlation ID that follows the request through every log. We added this retroactively and it was painful. Use OpenTelemetry from the start.

Idempotency keys are not optional. Networks fail. Clients retry. Without idempotency keys on your write operations, you will issue duplicate credentials. We learned this the hard way.

Health check endpoints on every service. A simple /health endpoint returning 200 saves hours during incidents. Your load balancer, your monitoring, and your sanity will thank you.

Key Takeaways

Split by business capability, never by technical layer
No shared databases — ever. Own your data or don't have it
Celery + Redis for anything that takes > 500ms
REST for synchronous queries, pub/sub for events
JWT local validation — no auth service round-trips on every request
Distributed tracing, idempotency keys, and health checks from day one

Back to All Posts

Written by Harshit Gupta