Free Resource

Cloud Readiness Checklist
for Engineering Teams

86 actionable checks across 10 domains. Use this to find what's slowing your team down, raising your cloud bill, or quietly increasing production risk — before it becomes an incident.

86 Total checks

10 Domains

~20min To complete

Free No signup needed

Cost Control & FinOps

Visibility and control over cloud spend before it becomes a fire drill.

8 checks

Top 5 cost drivers are visible in a shared dashboard reviewed monthly
Resource tagging policy enforced — every resource has owner, team, and env tags
Unused/idle resources (snapshots, unattached volumes, stopped VMs) are reclaimed automatically
Budget alerts are configured with business-context owners (not just the infra team)
Reserved Instances / Savings Plans cover ≥ 60% of baseline compute
Spot/Preemptible instances used for non-critical batch and CI workloads
Cost anomaly detection alerts are active and respond within 24h
Monthly unit-economics review: cost per user / per transaction tracked

Security & IAM

No shared credentials, no wildcard permissions, no unaudited access.

10 checks

No long-lived access keys in code, CI env vars, or developer machines
Least-privilege IAM roles — no wildcard (*) permissions in production
MFA enforced on all cloud console accounts and privileged roles
Secrets managed via Vault, AWS Secrets Manager, or equivalent — not env files
Network perimeter: public internet access limited to load balancers only
Security group / firewall rules reviewed quarterly and unused rules removed
All storage buckets / blobs are private by default; public access is explicit and logged
Vulnerability scanning runs on every container image before deploy
SAST/DAST integrated in CI pipeline — not a quarterly manual scan
Incident response runbook exists and was tested in the last 6 months

CI/CD & Release Safety

Deploys are boring. Rollbacks are fast. No single person is a deploy bottleneck.

10 checks

Every merge to main triggers automated build + test pipeline
Unit test coverage ≥ 70% for core business logic
Integration tests run against a staging environment before production deploy
No manual steps in the production deploy process
Rollback is documented, tested, and executes in under 10 minutes
Deploy ownership is shared across the team — no single-person bottleneck
Feature flags used for risky or incremental rollouts
Blue/green or canary deployment strategy in place for critical services
Database migrations are backwards-compatible and reversible
Dependency updates are automated (Dependabot / Renovate) with auto-merge for patches

Observability & Monitoring

You know your system is healthy before your customers tell you it's not.

10 checks

Structured logging in place — logs are queryable, not just text blobs
Distributed tracing active for all inter-service calls
Golden Signals monitored: latency, traffic, error rate, saturation
Business-impact dashboards exist (not just infra metrics)
Alerting routes to the right owner, not just #ops channel
Alert fatigue reviewed: false positive rate under 20%
On-call rotation documented, rotated, and not bottlenecked on one person
Error budgets defined and reviewed monthly
Runbooks linked directly from alert notifications
SLI/SLO defined and measured for every customer-facing service

Infrastructure as Code

No snowflake servers. No click-ops. Infra changes go through pull requests.

8 checks

100% of production infrastructure defined in IaC (Terraform, Pulumi, CDK)
IaC code is version-controlled and reviewed via PRs
No manual 'click-ops' changes in production console
Environments (dev/staging/prod) are created from the same IaC modules
State files stored remotely with locking (S3+DynamoDB, Terraform Cloud, etc.)
Drift detection runs on a schedule — alerts when reality diverges from code
IaC modules are versioned and shared across teams
Secrets never hardcoded in IaC — all referenced from secrets manager

Reliability & SLA

Single points of failure are identified and mitigated before they bite you.

8 checks

SLA defined for every external-facing service (uptime, latency, error rate)
No single points of failure in the critical path
Multi-AZ or multi-region redundancy for stateful services
Auto-scaling configured and tested for 3x normal peak traffic
Chaos engineering or fault injection runs quarterly
Load testing performed before every major launch
Circuit breakers or bulkheads implemented between service boundaries
Health check endpoints return meaningful status (not just HTTP 200)

Kubernetes / Container Maturity

Running containers in prod ≠ running them well. These are the gaps most teams miss.

10 checks

All workloads have resource requests and limits defined
Pod Disruption Budgets configured for critical deployments
Liveness and readiness probes correctly configured (not just / endpoint)
Images are non-root and use minimal base (distroless/alpine)
Network Policies restrict pod-to-pod communication by default
Cluster auto-scaler or Karpenter handles node scaling automatically
Secrets stored in external secret store — not native k8s Secrets in etcd plaintext
Namespace-level RBAC separates team access to cluster resources
GitOps (ArgoCD / Flux) manages cluster state — no kubectl apply in prod from laptops
Regular k8s version upgrades — cluster not more than 2 minor versions behind

Database & Data Management

Data loss is the incident you don't recover from. Most teams find this out too late.

9 checks

Automated backups run daily with retention ≥ 30 days
Backup restore tested in the last 90 days (not just 'we have backups')
Read replicas used for analytics and reporting queries
Connection pooling in place — app does not open unbounded DB connections
Slow query monitoring active with alerts
Point-in-time recovery (PITR) enabled for critical databases
Database credentials rotated on schedule via secrets manager
No direct production DB access from developer machines
Schema migrations reviewed for backward compatibility before deploy

Developer Experience & Velocity

Slow internal tools kill product velocity. These are the quick wins most teams skip.

7 checks

Local dev environment is documented and reproducible in under 30 minutes
CI pipeline completes in under 15 minutes (fast feedback loop)
Developers can deploy to staging without infra team involvement
On-boarding documentation updated and tested with last new hire
Dependency licenses scanned and policy enforced
Internal service catalog exists (what runs where, who owns it)
Post-mortems written and shared after every P0/P1 incident

Disaster Recovery & Business Continuity

Recovery Time Objective and Recovery Point Objective are defined — not guessed.

6 checks

RTO and RPO defined and signed off by the business for each critical service
DR runbook exists and was executed in the last 6 months
Cross-region or cross-cloud failover tested at least annually
Critical configuration and secrets backed up separately from application data
Communication plan for customer-facing outages is documented and approved
Post-recovery validation checklist confirms system integrity before traffic switch

How to score your results

80–100%

Cloud-mature. Focus on optimisation and cost efficiency.

50–79%

Growing pains. Prioritise security and reliability gaps before scaling.

Below 50%

High risk. Address cost control, IAM, and CI/CD as immediate priorities.

Found gaps? Let's fix them.

CloudWarrior offers a free 30-minute infrastructure audit call. We'll walk through your highest-risk areas and give you a prioritised action plan — no pitch, just substance.

Book a Free Infrastructure Audit Browse All Resources

Used by engineering teams at Series A–C startups across Europe. Typical engagement: 1–3 weeks.

Cloud Readiness Checklistfor Engineering Teams

How to score your results

Found gaps? Let's fix them.

Cloud Readiness Checklist
for Engineering Teams