The gap between "it works in Docker" and "it works well in production Docker" is enormous. Key practices: multi-stage builds (cut image size 70%), non-root user, proper signal handling with tini, explicit dependency pinning, health checks, and environment-based secrets — never baked-in secrets. This post walks through each with real Dockerfiles.
The "Works on My Machine" Problem at Scale
The promise of Docker is environment parity. And it delivers — until you deploy 20+ services to production and realize your images are 1.2GB each, your containers crash silently on SIGTERM, your secrets are baked into image layers visible to anyone with registry access, and a single failing container brings down the host because nobody set memory limits.
These aren't hypothetical problems. They're the checklist of issues we debugged at CertifyMe over 18 months of running Python microservices in Docker on AWS ECS. Here's the complete production-grade setup we converged on.
Practice 1: Multi-Stage Builds
A naive Python Dockerfile copies in requirements.txt, installs dependencies, and copies your code. The result: an image that includes build tools, compiler toolchains, and pip's cache — none of which you need at runtime. Our naive images were 1.4GB. Multi-stage builds cut them to under 400MB:
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
# Install build deps, compile wheels, then throw away build deps
RUN pip install --upgrade pip && \
pip wheel --no-cache-dir --no-deps --wheel-dir /wheels -r requirements.txt
# Stage 2: Runtime image
FROM python:3.11-slim AS runtime
# Install tini for proper signal handling
RUN apt-get update && apt-get install -y --no-install-recommends tini && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Create non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
# Copy only pre-built wheels from builder stage
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/*
COPY --chown=appuser:appuser . .
USER appuser
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "--workers", "4", "app:create_app()"]
Practice 2: Run as Non-Root
By default, Docker containers run as root. This means a container escape vulnerability gives an attacker root access to your host. The fix is two lines — create a dedicated user and switch to it before the CMD. Already shown above, but it's worth calling out explicitly because it's skipped far too often.
Run docker inspect <container> | grep -i user. If it returns empty, you're running as root. This is a common finding in security audits and one of the easiest to fix.
Practice 3: Proper Signal Handling with tini
When Docker stops a container, it sends SIGTERM to PID 1. If your app is PID 1 and doesn't handle SIGTERM, Docker waits 10 seconds then sends SIGKILL — a hard kill with no graceful shutdown. In-flight requests get dropped. Celery tasks get interrupted mid-execution.
tini is a minimal init process designed for containers. It properly handles signal forwarding and zombie process reaping. Using it as your ENTRYPOINT means your Python process gets signals correctly and has time to finish in-flight work before shutdown.
Practice 4: Pin Your Dependencies (Both Ways)
Never use unpinned requirements: flask instead of flask==3.0.2. A dependency update between your last build and the next can silently break behavior. But also pin your base image:
# Don't do this — "slim" is a moving target
FROM python:3.11-slim
# Do this — pinned by digest (immutable)
FROM python:3.11.9-slim@sha256:abc123...
# Or at minimum pin the patch version
FROM python:3.11.9-slim
Practice 5: Health Checks
Without a HEALTHCHECK directive, Docker and your orchestrator (ECS, Kubernetes) have no way to know if your container is actually serving traffic or just running. A container that's started but not ready is indistinguishable from a healthy one:
HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
And the corresponding Flask endpoint:
@app.route('/health')
def health():
# Check real dependencies — DB connection, Redis connection
try:
db.execute('SELECT 1')
redis_client.ping()
return {'status': 'healthy'}, 200
except Exception as e:
return {'status': 'unhealthy', 'error': str(e)}, 503
Practice 6: Secrets — Never in the Image
The most critical security practice. Never put secrets (API keys, DB passwords, JWT secrets) in your Dockerfile, docker-compose.yml, or environment variables that get baked into the image. Image layers are immutable and inspectable — anyone with pull access to your registry can extract every layer and read your .env file.
# Wrong — secret visible in image layer
ENV DATABASE_URL=postgresql://user:password@host/db
# Right — inject at runtime via orchestrator secrets
# In ECS task definition:
{
"secrets": [{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:region:account:secret:prod/db-url"
}]
}
Use a .env file with docker run --env-file .env. Make sure .env is in your .gitignore and never committed. In CI, inject secrets from your secrets manager (AWS Secrets Manager, GitHub Actions secrets, etc.).
Practice 7: Resource Limits
A Flask service with a memory leak or a runaway Celery task can consume all host memory, starving other containers on the same host. Always set memory and CPU limits in production:
# In docker-compose for local dev
services:
api:
image: my-flask-service
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
reservations:
memory: 256M
Practice 8: Optimize Layer Caching
Docker caches layers. Invalidating the cache for a layer invalidates all subsequent layers. Since dependencies change far less often than application code, always copy requirements.txt and install dependencies before copying your application code:
# Bad — cache invalidated on every code change
COPY . .
RUN pip install -r requirements.txt
# Good — dependencies cached unless requirements.txt changes
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
This alone cut our CI/CD build times from 4 minutes to 45 seconds for iterative deploys.
Key Takeaways
- Multi-stage builds: cut image sizes 70% by separating build from runtime
- Always run as a non-root user — two lines, huge security impact
- Use
tinias PID 1 for proper signal handling and graceful shutdown - Pin both Python version and dependency versions — reproducibility over convenience
- HEALTHCHECK is required for orchestrators to manage container lifecycle correctly
- Secrets injected at runtime, never baked into image layers
- Copy
requirements.txtfirst to maximize layer cache efficiency