AWSECSDockerFlaskDevOps

Deploying Flask Microservices to AWS ECS: A Production Setup Guide

28 June 2025·10 min read·Harshit Gupta

TL;DR

ECS Fargate is the fastest path to production containers on AWS without managing EC2 instances. Key components: ECR for image registry, ECS task definition (memory/CPU limits, secrets from Secrets Manager, env vars), ECS service (desired count, auto-scaling), ALB for traffic routing, and CloudWatch for logs. This guide walks through the complete setup.

Why ECS Over EC2 or Lambda

When we containerized CertifyMe's microservices, we had three deployment options: raw EC2 instances, Lambda functions, or ECS (Elastic Container Service). We chose ECS Fargate for the backend APIs and Celery workers for reasons that I think generalize well:

Long-running processes — Celery workers and Flask apps are persistent processes. Lambda's 15-minute timeout is a non-starter.
Predictable resource usage — we could right-size CPU/memory per service based on actual profiling, not worry about cold starts.
Existing Docker containers — we were already containerized, so ECS was the most direct path.
No instance management — Fargate eliminates patching, scaling, and maintaining EC2 fleets.

The Architecture

Route 53 → CloudFront (optional)
    ↓
Application Load Balancer
    ↓
ECS Service (Flask API)
    │  ├── Task 1 (Fargate container: Flask + Gunicorn)
    │  ├── Task 2 (auto-scaled based on CPU/request count)
    │  └── Task N
    ↓
ECR (Elastic Container Registry) — private image storage
    ↓
AWS Secrets Manager — DB password, API keys, JWT secret
    ↓
RDS MySQL (private subnet)   Redis (ElastiCache)

Step 1: Push Your Image to ECR

# Authenticate Docker to ECR
aws ecr get-login-password --region ap-south-1 | \
  docker login --username AWS --password-stdin \
  123456789.dkr.ecr.ap-south-1.amazonaws.com

# Create the repository
aws ecr create-repository \
  --repository-name certifyme/credential-service \
  --image-scanning-configuration scanOnPush=true

# Build, tag, and push
docker build -t certifyme/credential-service .
docker tag certifyme/credential-service:latest \
  123456789.dkr.ecr.ap-south-1.amazonaws.com/certifyme/credential-service:latest
docker push 123456789.dkr.ecr.ap-south-1.amazonaws.com/certifyme/credential-service:latest

Enable image scanning

The scanOnPush=true flag enables ECR's built-in CVE scanning on every push. It adds ~30 seconds to your push and gives you a vulnerability report on your image layers. Add a CI gate that fails builds on HIGH or CRITICAL findings.

Step 2: Create the Task Definition

The task definition is your container spec — it defines the image, resource limits, environment variables, secrets, and logging configuration:

{
  "family": "credential-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789:role/ecsTaskRole",
  "containerDefinitions": [{
    "name": "credential-service",
    "image": "123456789.dkr.ecr.ap-south-1.amazonaws.com/certifyme/credential-service:latest",
    "portMappings": [{"containerPort": 8000, "protocol": "tcp"}],
    "environment": [
      {"name": "FLASK_ENV", "value": "production"},
      {"name": "REDIS_HOST", "value": "your-elasticache-endpoint"}
    ],
    "secrets": [
      {
        "name": "DATABASE_URL",
        "valueFrom": "arn:aws:secretsmanager:ap-south-1:123456789:secret:prod/db-url"
      },
      {
        "name": "JWT_SECRET",
        "valueFrom": "arn:aws:secretsmanager:ap-south-1:123456789:secret:prod/jwt-secret"
      }
    ],
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"],
      "interval": 30,
      "timeout": 10,
      "retries": 3,
      "startPeriod": 15
    },
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/credential-service",
        "awslogs-region": "ap-south-1",
        "awslogs-stream-prefix": "ecs"
      }
    }
  }]
}

Step 3: Create the ECS Service

aws ecs create-service \
  --cluster production \
  --service-name credential-service \
  --task-definition credential-service:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[subnet-private-1a,subnet-private-1b],
    securityGroups=[sg-app-tier],
    assignPublicIp=DISABLED
  }" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,
    containerName=credential-service,containerPort=8000" \
  --health-check-grace-period-seconds 60

Step 4: Auto-Scaling

# Register scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production/credential-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 10

# Scale on CPU utilization
aws application-autoscaling put-scaling-policy \
  --policy-name cpu-scaling \
  --service-namespace ecs \
  --resource-id service/production/credential-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Set ScaleInCooldown much longer than ScaleOutCooldown

Scale out aggressively (60s cooldown) to handle traffic spikes fast. Scale in conservatively (300s cooldown) to avoid thrashing — scaling in and then immediately back out during traffic fluctuations wastes resources and causes brief capacity shortfalls.

CI/CD Integration

# GitHub Actions — deploy on push to main
- name: Deploy to ECS
  run: |
    # Update the task definition with the new image tag
    NEW_IMAGE="${{ env.ECR_REGISTRY }}/${{ env.SERVICE_NAME }}:${{ github.sha }}"

    TASK_DEF=$(aws ecs describe-task-definition --task-definition credential-service --query taskDefinition)
    NEW_TASK_DEF=$(echo $TASK_DEF | jq --arg IMAGE "$NEW_IMAGE" \
      '.containerDefinitions[0].image = $IMAGE | del(.taskDefinitionArn,.revision,.status,.requiresAttributes,.placementConstraints,.compatibilities,.registeredAt,.registeredBy)')

    NEW_TASK_ARN=$(aws ecs register-task-definition \
      --cli-input-json "$NEW_TASK_DEF" \
      --query taskDefinition.taskDefinitionArn --output text)

    aws ecs update-service \
      --cluster production \
      --service credential-service \
      --task-definition "$NEW_TASK_ARN"

    aws ecs wait services-stable \
      --cluster production \
      --services credential-service

Key Takeaways

ECS Fargate: no instance management, right-size CPU/memory per service, ideal for persistent workloads
Always pull secrets from Secrets Manager at runtime — never bake into task definitions
Scale out fast (60s cooldown), scale in slow (300s cooldown) to avoid capacity thrashing
Enable ECR image scanning on push with CI gates for HIGH/CRITICAL CVEs
aws ecs wait services-stable in CI ensures rollouts complete before marking a deploy successful
Private subnets for app tier + NAT gateway for outbound is the standard VPC pattern

Back to All Posts

Written by Harshit Gupta