CloudWarrior

Cloud Readiness Checklist
for Engineering Teams

86 actionable checks across 10 domains. Use this to find what's slowing your team down, raising your cloud bill, or quietly increasing production risk — before it becomes an incident.

86 Total checks
10 Domains
~20min To complete
Free No signup needed

Cost Control & FinOps

Visibility and control over cloud spend before it becomes a fire drill.

8 checks
  • Top 5 cost drivers are visible in a shared dashboard reviewed monthly
  • Resource tagging policy enforced — every resource has owner, team, and env tags
  • Unused/idle resources (snapshots, unattached volumes, stopped VMs) are reclaimed automatically
  • Budget alerts are configured with business-context owners (not just the infra team)
  • Reserved Instances / Savings Plans cover ≥ 60% of baseline compute
  • Spot/Preemptible instances used for non-critical batch and CI workloads
  • Cost anomaly detection alerts are active and respond within 24h
  • Monthly unit-economics review: cost per user / per transaction tracked

Security & IAM

No shared credentials, no wildcard permissions, no unaudited access.

10 checks
  • No long-lived access keys in code, CI env vars, or developer machines
  • Least-privilege IAM roles — no wildcard (*) permissions in production
  • MFA enforced on all cloud console accounts and privileged roles
  • Secrets managed via Vault, AWS Secrets Manager, or equivalent — not env files
  • Network perimeter: public internet access limited to load balancers only
  • Security group / firewall rules reviewed quarterly and unused rules removed
  • All storage buckets / blobs are private by default; public access is explicit and logged
  • Vulnerability scanning runs on every container image before deploy
  • SAST/DAST integrated in CI pipeline — not a quarterly manual scan
  • Incident response runbook exists and was tested in the last 6 months

CI/CD & Release Safety

Deploys are boring. Rollbacks are fast. No single person is a deploy bottleneck.

10 checks
  • Every merge to main triggers automated build + test pipeline
  • Unit test coverage ≥ 70% for core business logic
  • Integration tests run against a staging environment before production deploy
  • No manual steps in the production deploy process
  • Rollback is documented, tested, and executes in under 10 minutes
  • Deploy ownership is shared across the team — no single-person bottleneck
  • Feature flags used for risky or incremental rollouts
  • Blue/green or canary deployment strategy in place for critical services
  • Database migrations are backwards-compatible and reversible
  • Dependency updates are automated (Dependabot / Renovate) with auto-merge for patches

Observability & Monitoring

You know your system is healthy before your customers tell you it's not.

10 checks
  • Structured logging in place — logs are queryable, not just text blobs
  • Distributed tracing active for all inter-service calls
  • Golden Signals monitored: latency, traffic, error rate, saturation
  • Business-impact dashboards exist (not just infra metrics)
  • Alerting routes to the right owner, not just #ops channel
  • Alert fatigue reviewed: false positive rate under 20%
  • On-call rotation documented, rotated, and not bottlenecked on one person
  • Error budgets defined and reviewed monthly
  • Runbooks linked directly from alert notifications
  • SLI/SLO defined and measured for every customer-facing service

Infrastructure as Code

No snowflake servers. No click-ops. Infra changes go through pull requests.

8 checks
  • 100% of production infrastructure defined in IaC (Terraform, Pulumi, CDK)
  • IaC code is version-controlled and reviewed via PRs
  • No manual 'click-ops' changes in production console
  • Environments (dev/staging/prod) are created from the same IaC modules
  • State files stored remotely with locking (S3+DynamoDB, Terraform Cloud, etc.)
  • Drift detection runs on a schedule — alerts when reality diverges from code
  • IaC modules are versioned and shared across teams
  • Secrets never hardcoded in IaC — all referenced from secrets manager

Reliability & SLA

Single points of failure are identified and mitigated before they bite you.

8 checks
  • SLA defined for every external-facing service (uptime, latency, error rate)
  • No single points of failure in the critical path
  • Multi-AZ or multi-region redundancy for stateful services
  • Auto-scaling configured and tested for 3x normal peak traffic
  • Chaos engineering or fault injection runs quarterly
  • Load testing performed before every major launch
  • Circuit breakers or bulkheads implemented between service boundaries
  • Health check endpoints return meaningful status (not just HTTP 200)

Kubernetes / Container Maturity

Running containers in prod ≠ running them well. These are the gaps most teams miss.

10 checks
  • All workloads have resource requests and limits defined
  • Pod Disruption Budgets configured for critical deployments
  • Liveness and readiness probes correctly configured (not just / endpoint)
  • Images are non-root and use minimal base (distroless/alpine)
  • Network Policies restrict pod-to-pod communication by default
  • Cluster auto-scaler or Karpenter handles node scaling automatically
  • Secrets stored in external secret store — not native k8s Secrets in etcd plaintext
  • Namespace-level RBAC separates team access to cluster resources
  • GitOps (ArgoCD / Flux) manages cluster state — no kubectl apply in prod from laptops
  • Regular k8s version upgrades — cluster not more than 2 minor versions behind

Database & Data Management

Data loss is the incident you don't recover from. Most teams find this out too late.

9 checks
  • Automated backups run daily with retention ≥ 30 days
  • Backup restore tested in the last 90 days (not just 'we have backups')
  • Read replicas used for analytics and reporting queries
  • Connection pooling in place — app does not open unbounded DB connections
  • Slow query monitoring active with alerts
  • Point-in-time recovery (PITR) enabled for critical databases
  • Database credentials rotated on schedule via secrets manager
  • No direct production DB access from developer machines
  • Schema migrations reviewed for backward compatibility before deploy

Developer Experience & Velocity

Slow internal tools kill product velocity. These are the quick wins most teams skip.

7 checks
  • Local dev environment is documented and reproducible in under 30 minutes
  • CI pipeline completes in under 15 minutes (fast feedback loop)
  • Developers can deploy to staging without infra team involvement
  • On-boarding documentation updated and tested with last new hire
  • Dependency licenses scanned and policy enforced
  • Internal service catalog exists (what runs where, who owns it)
  • Post-mortems written and shared after every P0/P1 incident

Disaster Recovery & Business Continuity

Recovery Time Objective and Recovery Point Objective are defined — not guessed.

6 checks
  • RTO and RPO defined and signed off by the business for each critical service
  • DR runbook exists and was executed in the last 6 months
  • Cross-region or cross-cloud failover tested at least annually
  • Critical configuration and secrets backed up separately from application data
  • Communication plan for customer-facing outages is documented and approved
  • Post-recovery validation checklist confirms system integrity before traffic switch

How to score your results

80–100%
Cloud-mature. Focus on optimisation and cost efficiency.
50–79%
Growing pains. Prioritise security and reliability gaps before scaling.
Below 50%
High risk. Address cost control, IAM, and CI/CD as immediate priorities.

Found gaps? Let's fix them.

CloudWarrior offers a free 30-minute infrastructure audit call. We'll walk through your highest-risk areas and give you a prioritised action plan — no pitch, just substance.

Book a Free Infrastructure Audit Browse All Resources

Used by engineering teams at Series A–C startups across Europe. Typical engagement: 1–3 weeks.

Get Your Free Infrastructure Audit