The Big Picture Direction: Cloud Platforms That Hold Up in Production
This document is a big-picture direction for becoming the engineer described by this value proposition:
I build secure delivery systems, reliable infrastructure, and operational guardrails for teams that need more than YAML, dashboards, and green pipelines. The goal is simple: ship confidently, recover quickly, and own failure before users feel it.
The goal is not to become a tool collector. The goal is to become the person teams trust when production matters.
North Star#
Become the engineer who can answer these questions clearly:
- Can we deploy this safely?
- If it fails, how will we know?
- If users are affected, what signal fires first?
- Can we roll back?
- Can we recover data?
- Who owns this service?
- What is the blast radius?
- What guardrail prevents this mistake next time?
- What should be automated?
- What should still require human review?
That is how you become the person teams trust when the platform has to hold up in production.
1. The Mindset#
1.1 Think in outcomes, not tools#
Tools are only useful when they improve an operational outcome.
Do not say:
- "I know Kubernetes."
- "I know Terraform."
- "I know GitHub Actions."
Say:
- "I can make deployments safer."
- "I can make infrastructure reviewable and recoverable."
- "I can reduce mean time to recovery."
- "I can design guardrails that stop common failure modes."
- "I can help teams understand whether production is healthy."
1.2 Own failure end-to-end#
Production reliability is not only about preventing failure. It is about preparing for failure.
You should constantly ask:
- What happens if this dependency slows down?
- What happens if this deployment partially succeeds?
- What happens if this Terraform change replaces a shared resource?
- What happens if a secret expires?
- What happens if a region or availability zone fails?
- What signal tells us users are suffering before support tickets arrive?
1.3 Automate, but keep judgment in the loop#
AI and automation can write YAML, Terraform, scripts, and runbooks faster than humans.
Your value is knowing:
- what should be automated
- what should require review
- what blast radius is acceptable
- what signals prove the change worked
- what rollback path exists
- what security boundary must not be crossed
1.4 Prefer boring systems#
The best platform work often feels boring:
- predictable deployments
- boring rollbacks
- boring alerts
- boring infrastructure reviews
- boring incident response
- boring onboarding
Boring means the system is understandable under pressure.
2. Core Capability Map#
2.1 Cloud foundations#
You need enough cloud architecture depth to build environments that teams can safely use.
Focus areas:
- AWS account structure and organization strategy
- IAM roles, permission boundaries, and least privilege
- VPC design, subnets, routing, NAT, security groups, NACLs
- Load balancing and DNS
- Secrets management
- Compute choices: EC2, ECS, EKS, Lambda
- Storage choices: S3, EBS, EFS
- Database operations: RDS, backups, failover, read replicas
- Multi-environment design: dev, staging, production
Tools and services to know:
- AWS Organizations
- IAM
- VPC
- Route 53
- ACM
- ALB / NLB
- EKS
- ECS
- RDS
- S3
- Secrets Manager
- Systems Manager Parameter Store
- CloudWatch
- CloudTrail
- AWS Config
- AWS Backup
What "good" looks like:
- Environments are consistent.
- IAM is scoped and reviewable.
- Networking is understandable.
- Production has backups, monitoring, and recovery paths.
- Changes are made through code, not console-clicking.
2.2 Infrastructure as Code#
IaC is not valuable because it creates resources. It is valuable because it makes infrastructure reviewable, repeatable, and recoverable.
Tools:
- Terraform
- OpenTofu
- Terragrunt, only when complexity justifies it
- AWS CloudFormation, enough to understand AWS-native patterns
- Checkov / tfsec / Trivy for IaC scanning
- Infracost for cost visibility
Skills:
- module design
- remote state
- state locking
- environment composition
- drift detection
- plan review
- policy checks
- import and refactor strategy
- safe destroy prevention
Projects to build:
- Multi-account AWS baseline with Terraform.
- VPC + EKS + RDS platform module.
- Terraform plan review pipeline with security and cost checks.
- Drift detection workflow that opens an issue when drift appears.
What "good" looks like:
- Terraform modules are small and understandable.
- Plans are reviewed before apply.
- Destructive changes are guarded.
- State is remote, locked, and backed up.
- Environments share patterns without copy-paste chaos.
2.3 Kubernetes and cloud-native operations#
Kubernetes is not maturity by itself. It is only useful if workloads are easier to deploy, observe, scale, and recover.
Tools:
- Kubernetes
- Helm
- Kustomize
- Argo CD
- ExternalDNS
- cert-manager
- AWS Load Balancer Controller
- Karpenter
- Metrics Server
- Prometheus
- Grafana
- OpenTelemetry
- Kyverno or OPA Gatekeeper
- Sealed Secrets or External Secrets Operator
- Velero
Skills:
- pod lifecycle
- deployments and rollouts
- services and ingress
- resource requests and limits
- probes
- autoscaling
- node provisioning
- workload identity
- network policy
- policy enforcement
- secret management
- backup and disaster recovery
- GitOps operations
Projects to build:
- EKS cluster with Terraform and Karpenter.
- GitOps deployment workflow with Argo CD.
- Kubernetes workload baseline chart with probes, resources, PDBs, HPA, and NetworkPolicy.
- External Secrets integration with AWS Secrets Manager.
- Velero backup and restore demo.
What "good" looks like:
- Workloads declare health clearly.
- Deployments can roll forward and roll back.
- Secrets are not stored in plain Git.
- Policies prevent unsafe defaults.
- Teams can deploy without needing cluster-admin access.
2.4 Secure delivery systems#
The pipeline should not just deploy. It should protect the organization from bad changes.
Tools:
- GitHub Actions
- GitLab CI
- Jenkins, enough to support legacy teams
- Argo CD
- Cosign
- Syft
- Grype
- Trivy
- SLSA concepts
- OIDC federation
- Dependabot / Renovate
- SonarQube
Skills:
- pipeline design
- artifact versioning
- environment promotion
- OIDC-based cloud authentication
- secret-free CI
- container scanning
- dependency scanning
- SBOM generation
- signed images
- deployment approvals
- rollback automation
Projects to build:
- GitHub Actions pipeline that builds, scans, signs, and deploys a container.
- OIDC authentication from GitHub Actions to AWS.
- Progressive promotion from dev to staging to production.
- Automated rollback on failed smoke test.
What "good" looks like:
- CI does not use long-lived cloud keys.
- Every artifact is traceable to a commit.
- Vulnerability checks happen before deploy.
- Failed releases stop early.
- Rollback is documented and tested.
2.5 Observability and production signal#
Dashboards are not enough. You need signals that explain user impact and system behavior.
Tools:
- OpenTelemetry
- Prometheus
- Grafana
- Loki
- Tempo
- Alertmanager
- CloudWatch
- Sentry
- Datadog or New Relic, enough to understand commercial observability platforms
Signals to master:
- latency
- traffic
- errors
- saturation
- availability
- request traces
- dependency latency
- queue depth
- database connection pressure
- retry storms
- cost anomalies
Skills:
- RED metrics
- USE metrics
- SLOs and error budgets
- alert routing
- alert severity design
- dashboard design
- distributed tracing
- log correlation
- practical runbooks
Projects to build:
- OpenTelemetry-instrumented Node.js service.
- Prometheus + Grafana dashboard for API latency, errors, and saturation.
- SLO dashboard with burn-rate alerts.
- Trace-based debugging demo across multiple services.
What "good" looks like:
- Alerts map to user impact or real operational risk.
- Dashboards answer specific questions.
- Logs, metrics, and traces connect to the same incident.
- On-call engineers know what action to take.
2.6 Reliability engineering and incident response#
This is where you become more than a deployment engineer.
Skills:
- incident command
- triage
- rollback decisions
- postmortems
- failure mode analysis
- dependency mapping
- load testing
- chaos testing
- capacity planning
- disaster recovery
Tools:
- Grafana OnCall / PagerDuty / Opsgenie
- Incident.io
- Statuspage
- k6
- Locust
- LitmusChaos
- AWS Fault Injection Service
Projects to build:
- Incident response playbook for a Kubernetes API outage.
- Postmortem for a failed deployment.
- Load test that reveals saturation limits.
- DR drill restoring database backup into a fresh environment.
What "good" looks like:
- Incidents have roles.
- Rollback criteria are known before deployment.
- Postmortems focus on learning, not blame.
- Recovery steps are tested, not assumed.
2.7 Platform engineering and developer experience#
Platform engineering is about making the right path the easy path.
Tools:
- Backstage
- Score
- Humanitec concepts
- Crossplane
- Kubernetes operators
- Argo CD
- Terraform modules
- Internal templates
Skills:
- golden paths
- self-service environments
- service catalog design
- platform APIs
- developer onboarding
- paved-road templates
- platform success metrics
Projects to build:
- Backstage developer portal with service catalog.
- Self-service app template that creates repo, pipeline, Helm chart, and observability defaults.
- Platform API for provisioning a database or namespace.
- Developer onboarding workflow measured from "new repo" to "running in staging."
What "good" looks like:
- Developers can ship without learning every infrastructure detail.
- Platform defaults include security, observability, and rollback behavior.
- The platform reduces cognitive load.
- Teams are not blocked by ticket queues for routine work.
2.8 Security and governance#
Security must be built into the path, not added at the end.
Tools:
- IAM Access Analyzer
- AWS Config
- GuardDuty
- Security Hub
- KMS
- Secrets Manager
- Kyverno
- OPA Gatekeeper
- Trivy
- Checkov
- Cosign
- Falco
Skills:
- least privilege
- workload identity
- secret rotation
- network segmentation
- image scanning
- policy as code
- audit trails
- secure CI/CD
- incident response for security events
Projects to build:
- Kubernetes policy pack that blocks privileged pods and missing resource limits.
- CI pipeline with image scanning and SBOM output.
- AWS IAM review workflow for excessive permissions.
- Secret rotation demo with External Secrets Operator.
What "good" looks like:
- Unsafe workloads are blocked by default.
- Secrets are not copied manually.
- Cloud access is short-lived and auditable.
- Security checks happen before production.
2.9 Cost and efficiency#
Production systems must hold up financially too.
Tools:
- AWS Cost Explorer
- AWS Budgets
- Compute Optimizer
- Kubecost
- Infracost
- Karpenter
- HPA / VPA
Skills:
- right-sizing
- autoscaling
- cost allocation tags
- idle resource detection
- reserved capacity basics
- spot capacity trade-offs
- cost-aware architecture
Projects to build:
- Kubernetes cost dashboard with Kubecost.
- Terraform pull request cost estimate with Infracost.
- Karpenter node consolidation demo.
- AWS budget alerting and cost anomaly notification workflow.
What "good" looks like:
- Teams know what services cost.
- Waste is visible.
- Autoscaling balances reliability and cost.
- Cost controls do not surprise production workloads.
3. Technology Stack to Prioritize#
Tier 1: Must-have foundation#
- Linux
- Networking
- Git
- Docker
- AWS
- Terraform
- Kubernetes
- GitHub Actions
- Bash
- TypeScript / Node.js
- SQL basics
Tier 2: Production platform stack#
- EKS
- Helm
- Argo CD
- Karpenter
- External Secrets Operator
- AWS Load Balancer Controller
- Prometheus
- Grafana
- OpenTelemetry
- Loki / Tempo
- Kyverno or OPA Gatekeeper
- Trivy
- Checkov
- Infracost
Tier 3: Differentiators#
- Backstage
- Crossplane
- Kubernetes operators
- Cosign and software supply chain security
- SLO tooling
- Incident management tooling
- Chaos testing
- AI-assisted operations and runbook generation
- Platform APIs
4. Certifications to Target#
Do not collect certifications randomly. Use them as milestones for capability.
Cloud#
AWS Certified Solutions Architect - Associate#
Why:
- Validates cloud architecture fundamentals.
- Helps with networking, IAM, compute, storage, and resilience decisions.
AWS Certified DevOps Engineer - Professional#
Why:
- Directly aligns with delivery automation, monitoring/logging, resilient cloud solutions, incident/event response, security, and compliance.
- AWS says this exam validates expertise in provisioning, operating, and managing distributed systems and services on AWS.
Google Professional Cloud DevOps Engineer#
Why:
- Strong SRE alignment.
- Google describes the role as balancing reliability with delivery speed while optimizing production systems for performance and cost.
Kubernetes and cloud native#
CKA: Certified Kubernetes Administrator#
Why:
- Proves hands-on Kubernetes operations ability.
- Useful for cluster administration, troubleshooting, and workload operations.
CKS: Certified Kubernetes Security Specialist#
Why:
- Best next step after CKA if you want to own production security.
- Covers workload, cluster, supply chain, and runtime security.
PCA: Prometheus Certified Associate#
Why:
- Helps prove monitoring and alerting fundamentals.
OTCA: OpenTelemetry Certified Associate#
Why:
- Good fit for modern observability and telemetry pipelines.
CAPA: Certified Argo Project Associate#
Why:
- Useful if you want GitOps to be a visible part of your platform story.
CNPA / CNPE: Cloud Native Platform Engineering#
Why:
- CNPA validates platform engineering fundamentals like automation, security, observability, continuous delivery, platform APIs, and developer experience.
- CNPE is performance-based and targets advanced platform engineering: GitOps/CD, self-service capabilities, observability/operations, security, and policy enforcement.
Infrastructure as Code#
Terraform Associate#
Why:
- Validates Terraform fundamentals.
Terraform Authoring and Operations Professional#
Why:
- Better long-term signal than Associate if you want to show advanced Terraform design and operations ability.
5. Resources to Study#
Reliability and operating production systems#
- Google SRE books and resources
- AWS Well-Architected Framework
- AWS Operational Excellence Pillar
- AWS Reliability Pillar
- Google Cloud SRE guidance
- DORA / State of DevOps research
What to extract:
- SLOs
- error budgets
- alert quality
- incident response
- postmortem culture
- reducing toil
- release engineering
- capacity planning
Cloud and architecture#
- AWS Well-Architected Labs
- AWS Architecture Center
- AWS Builders' Library
- Google Cloud Architecture Center
- Azure Architecture Center, even if AWS is primary
What to extract:
- trade-off thinking
- blast radius reduction
- multi-AZ design
- backup and restore patterns
- identity boundaries
- operational readiness
Kubernetes and platform engineering#
- Kubernetes official docs
- CNCF landscape
- CNCF platform engineering certifications and curricula
- Argo CD docs
- Helm docs
- Karpenter docs
- External Secrets Operator docs
- Backstage docs
What to extract:
- workload health
- GitOps operation
- cluster security
- platform APIs
- developer self-service
- service catalogs
Observability#
- OpenTelemetry docs
- Prometheus docs
- Grafana docs
- Google SRE chapters on monitoring and alerting
What to extract:
- metrics, logs, traces
- telemetry pipelines
- alert design
- SLO dashboards
- burn-rate alerts
- correlation during incidents
Security#
- AWS security best practices
- Kubernetes security docs
- OWASP Top 10
- OWASP Kubernetes Top 10
- SLSA supply chain security
- Sigstore / Cosign docs
What to extract:
- least privilege
- secure CI/CD
- secrets handling
- image provenance
- runtime detection
- policy as code
Books#
- Site Reliability Engineering by Google
- The Site Reliability Workbook by Google
- Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim
- The Phoenix Project
- The Unicorn Project
- Designing Data-Intensive Applications
- Release It!
- Cloud Native DevOps with Kubernetes
- Kubernetes Patterns
- Infrastructure as Code by Kief Morris
6. Portfolio Projects to Build#
These projects should prove you can help teams, not just deploy tools.
Project 1: Production EKS Platform Baseline#
Goal:
Build a secure AWS EKS platform that a team could realistically use.
Include:
- Terraform-managed VPC
- EKS cluster
- managed node groups or Karpenter
- IAM roles for service accounts
- External Secrets Operator
- AWS Load Balancer Controller
- cert-manager
- Argo CD
- Prometheus and Grafana
- basic NetworkPolicy
- example app deployment
Show:
- architecture diagram
- Terraform module structure
- deployment flow
- security decisions
- failure modes
- cost notes
- runbook
Project 2: Secure CI/CD Delivery System#
Goal:
Build a pipeline that protects production.
Include:
- GitHub Actions
- Docker build
- unit tests
- linting
- dependency scan
- image scan
- SBOM generation
- image signing with Cosign
- OIDC authentication to AWS
- deploy to staging
- smoke tests
- manual approval for production
- rollback workflow
Show:
- why each gate exists
- what happens on failure
- how secrets are avoided
- how artifacts are traced
Project 3: Observability and SLO Platform#
Goal:
Prove you can detect user-impacting problems before support tickets arrive.
Include:
- OpenTelemetry instrumentation
- metrics, logs, traces
- Prometheus
- Grafana
- Loki
- Tempo
- SLO dashboard
- burn-rate alerts
- runbooks attached to alerts
Show:
- good alert vs bad alert
- dashboard screenshots
- example incident walkthrough
- trace showing a slow dependency
Project 4: Failure and Recovery Lab#
Goal:
Demonstrate recovery, not just deployment.
Include:
- Kubernetes app
- database
- backup workflow
- restore workflow
- simulated failed deploy
- simulated dependency latency
- load test
- incident timeline
- postmortem
Show:
- recovery time
- what signal detected failure
- what action recovered the system
- what guardrail you added afterward
Project 5: Internal Developer Platform Starter#
Goal:
Show platform engineering and developer experience.
Include:
- Backstage
- service catalog
- golden path template
- app scaffold
- default CI pipeline
- default Helm chart
- default observability dashboard
- default runbook
Show:
- how a developer creates a new service
- what defaults are included
- how the platform reduces cognitive load
- how you measure platform success
Project 6: Policy-as-Code Guardrails#
Goal:
Show governance without blocking delivery unnecessarily.
Include:
- Kyverno or OPA Gatekeeper
- policies for privileged pods
- policies for resource requests/limits
- policies for required labels
- policies for image registry rules
- CI policy checks
Show:
- unsafe deployment blocked
- safe deployment allowed
- exception process
- policy documentation
Project 7: Cost-Aware Kubernetes Platform#
Goal:
Show that production reliability includes cost control.
Include:
- Karpenter
- Kubecost
- Infracost
- AWS Budgets
- right-sizing recommendations
- idle workload detection
Show:
- before/after cost estimate
- node consolidation
- team-level cost visibility
- trade-offs between availability and cost
7. Case Studies to Write#
Each case study should follow this structure:
- Context
- Problem
- Constraints
- Design options
- Chosen architecture
- Implementation
- Failure modes
- Operational guardrails
- Results
- Lessons learned
Case Study 1: Building a Production-Ready EKS Baseline#
Angle:
"Kubernetes is not maturity unless the platform is secure, observable, and recoverable."
Cover:
- networking
- IAM
- GitOps
- secrets
- ingress
- workload identity
- observability
- backup strategy
Case Study 2: From Green Pipeline to Safe Delivery#
Angle:
"A green pipeline only means the pipeline passed. It does not mean production is safe."
Cover:
- scanning
- signing
- promotion
- smoke tests
- rollback
- artifact traceability
Case Study 3: Reducing Alert Noise With Better Signals#
Angle:
"The goal is not more alerts. The goal is actionable signal."
Cover:
- alert audit
- SLOs
- burn-rate alerts
- runbooks
- routing
- severity levels
Case Study 4: Designing Terraform Modules That Do Not Surprise Production#
Angle:
"Terraform does exactly what you ask. The hard part is making sure you asked safely."
Cover:
- module boundaries
- plan review
- drift detection
- state safety
- destructive change guardrails
Case Study 5: Recovering From a Bad Deployment#
Angle:
"Rollback is not a button. It is an operational design."
Cover:
- deployment strategy
- health checks
- smoke tests
- rollback criteria
- incident timeline
- postmortem
Case Study 6: Building a Developer Platform Golden Path#
Angle:
"The platform should make the secure, observable path the easiest path."
Cover:
- service templates
- CI defaults
- observability defaults
- documentation
- ownership metadata
- developer feedback
Case Study 7: Secrets Management Without Copy-Paste Risk#
Angle:
"Secret handling is an operations problem, not just a security checkbox."
Cover:
- External Secrets Operator
- AWS Secrets Manager
- IAM roles
- rotation
- auditability
8. Blog Topics to Write#
These should position you as a practical platform reliability engineer.
Foundational posts#
- Kubernetes is not maturity: what teams still get wrong after adopting it
- Green pipelines are not safe deployments
- Terraform did exactly what we asked: why plan review matters
- What makes infrastructure "reviewable"?
- Why production readiness starts before launch
Reliability posts#
- SLOs explained without enterprise jargon
- Alert fatigue: how to design alerts people trust
- How to write a useful runbook
- What a good postmortem actually includes
- The difference between uptime and user trust
Platform engineering posts#
- What is a golden path?
- Internal developer platforms without buzzwords
- Backstage: what problem does it actually solve?
- Platform engineering is product work
- Measuring platform success beyond deployment count
Cloud and Kubernetes posts#
- EKS baseline: the pieces teams forget
- IAM Roles for Service Accounts explained
- Karpenter vs Cluster Autoscaler
- External Secrets Operator in real workflows
- Kubernetes probes: how bad health checks cause outages
CI/CD and supply chain posts#
- OIDC in CI/CD: why long-lived AWS keys should disappear
- Container image signing with Cosign
- SBOMs explained for working engineers
- Progressive delivery without overcomplication
- Rollback strategy for small teams
AI and automation posts#
- AI can write Terraform, but can it own the blast radius?
- How I use AI to review infrastructure changes
- The new DevOps skill: asking better operational questions
- Automating toil without hiding risk
- Why human judgment still matters in platform engineering
9. What to Measure#
Use metrics to learn, not to decorate dashboards.
Delivery metrics#
- deployment frequency
- lead time for changes
- change failure rate
- mean time to recovery
Reliability metrics#
- availability
- latency percentiles
- error rate
- saturation
- SLO compliance
- error budget burn
Platform metrics#
- time to create a new service
- time to first deploy
- percentage of services using golden path
- number of manual platform tickets reduced
- developer satisfaction
- onboarding time
Security metrics#
- unresolved critical vulnerabilities
- workloads missing resource limits
- workloads using privileged permissions
- secrets nearing expiration
- CI jobs using long-lived credentials
Cost metrics#
- idle compute
- namespace/team cost
- over-provisioned workloads
- monthly cost trend
- cost per environment
10. How to Talk About Yourself#
Avoid:
- "I am a DevOps engineer who knows many tools."
- "I deploy Kubernetes applications."
- "I write Terraform and CI/CD."
Use:
- "I build cloud platforms that make production safer to change."
- "I design delivery systems with security, observability, and rollback built in."
- "I help teams ship confidently by turning operational knowledge into guardrails."
- "I use automation and AI to reduce toil, but keep human judgment around blast radius, security, and recovery."
- "I care about the gap between green dashboards and real user experience."
11. Portfolio Positioning Checklist#
Your portfolio should prove:
- You understand systems beyond tools.
- You can design cloud foundations.
- You can secure delivery.
- You can observe production meaningfully.
- You can recover from failure.
- You can write clear runbooks and postmortems.
- You can reduce developer friction.
- You can explain trade-offs.
- You can use automation without losing ownership.
For every project, include:
- architecture diagram
- problem statement
- constraints
- tools used
- why those tools
- security decisions
- failure modes
- observability design
- rollback/recovery plan
- lessons learned
12. Recommended Source Links#
Official certification and cloud-native sources#
- AWS Certified DevOps Engineer - Professional: https://docs.aws.amazon.com/aws-certification/latest/userguide/devops-engineer-professional-02.html
- CNCF Cloud Native Certifications: https://www.cncf.io/training/certification/
- CNCF Certified Cloud Native Platform Engineering Associate: https://www.cncf.io/training/certification/cnpa/
- CNCF Certified Cloud Native Platform Engineer: https://www.cncf.io/training/certification/cnpe/
- HashiCorp Certifications: https://www.hashicorp.com/certification
- Google Professional Cloud DevOps Engineer: https://cloud.google.com/learn/certification/cloud-devops-engineer
Reliability, observability, and operations#
- AWS Well-Architected Framework: https://aws.amazon.com/architecture/well-architected/
- AWS Operational Excellence Pillar: https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html
- Google SRE resources: https://sre.google/resources/
- Google Cloud SRE overview: https://cloud.google.com/sre
- OpenTelemetry documentation: https://opentelemetry.io/docs/
- DORA metrics overview: https://www.atlassian.com/devops/frameworks/dora-metrics
