The Big Picture Direction: Cloud Platforms That Hold Up in Production

This document is a big-picture direction for becoming the engineer described by this value proposition:

I build secure delivery systems, reliable infrastructure, and operational guardrails for teams that need more than YAML, dashboards, and green pipelines. The goal is simple: ship confidently, recover quickly, and own failure before users feel it.

The goal is not to become a tool collector. The goal is to become the person teams trust when production matters.

North Star#

Become the engineer who can answer these questions clearly:

Can we deploy this safely?
If it fails, how will we know?
If users are affected, what signal fires first?
Can we roll back?
Can we recover data?
Who owns this service?
What is the blast radius?
What guardrail prevents this mistake next time?
What should be automated?
What should still require human review?

That is how you become the person teams trust when the platform has to hold up in production.

1. The Mindset#

1.1 Think in outcomes, not tools#

Tools are only useful when they improve an operational outcome.

Do not say:

"I know Kubernetes."
"I know Terraform."
"I know GitHub Actions."

Say:

"I can make deployments safer."
"I can make infrastructure reviewable and recoverable."
"I can reduce mean time to recovery."
"I can design guardrails that stop common failure modes."
"I can help teams understand whether production is healthy."

1.2 Own failure end-to-end#

Production reliability is not only about preventing failure. It is about preparing for failure.

You should constantly ask:

What happens if this dependency slows down?
What happens if this deployment partially succeeds?
What happens if this Terraform change replaces a shared resource?
What happens if a secret expires?
What happens if a region or availability zone fails?
What signal tells us users are suffering before support tickets arrive?

1.3 Automate, but keep judgment in the loop#

AI and automation can write YAML, Terraform, scripts, and runbooks faster than humans.

Your value is knowing:

what should be automated
what should require review
what blast radius is acceptable
what signals prove the change worked
what rollback path exists
what security boundary must not be crossed

1.4 Prefer boring systems#

The best platform work often feels boring:

predictable deployments
boring rollbacks
boring alerts
boring infrastructure reviews
boring incident response
boring onboarding

Boring means the system is understandable under pressure.

2. Core Capability Map#

2.1 Cloud foundations#

You need enough cloud architecture depth to build environments that teams can safely use.

Focus areas:

AWS account structure and organization strategy
IAM roles, permission boundaries, and least privilege
VPC design, subnets, routing, NAT, security groups, NACLs
Load balancing and DNS
Secrets management
Compute choices: EC2, ECS, EKS, Lambda
Storage choices: S3, EBS, EFS
Database operations: RDS, backups, failover, read replicas
Multi-environment design: dev, staging, production

Tools and services to know:

AWS Organizations
IAM
VPC
Route 53
ACM
ALB / NLB
EKS
ECS
RDS
S3
Secrets Manager
Systems Manager Parameter Store
CloudWatch
CloudTrail
AWS Config
AWS Backup

What "good" looks like:

Environments are consistent.
IAM is scoped and reviewable.
Networking is understandable.
Production has backups, monitoring, and recovery paths.
Changes are made through code, not console-clicking.

2.2 Infrastructure as Code#

IaC is not valuable because it creates resources. It is valuable because it makes infrastructure reviewable, repeatable, and recoverable.

Tools:

Terraform
OpenTofu
Terragrunt, only when complexity justifies it
AWS CloudFormation, enough to understand AWS-native patterns
Checkov / tfsec / Trivy for IaC scanning
Infracost for cost visibility

Skills:

module design
remote state
state locking
environment composition
drift detection
plan review
policy checks
import and refactor strategy
safe destroy prevention

Projects to build:

Multi-account AWS baseline with Terraform.
VPC + EKS + RDS platform module.
Terraform plan review pipeline with security and cost checks.
Drift detection workflow that opens an issue when drift appears.

What "good" looks like:

Terraform modules are small and understandable.
Plans are reviewed before apply.
Destructive changes are guarded.
State is remote, locked, and backed up.
Environments share patterns without copy-paste chaos.

2.3 Kubernetes and cloud-native operations#

Kubernetes is not maturity by itself. It is only useful if workloads are easier to deploy, observe, scale, and recover.

Tools:

Kubernetes
Helm
Kustomize
Argo CD
ExternalDNS
cert-manager
AWS Load Balancer Controller
Karpenter
Metrics Server
Prometheus
Grafana
OpenTelemetry
Kyverno or OPA Gatekeeper
Sealed Secrets or External Secrets Operator
Velero

Skills:

pod lifecycle
deployments and rollouts
services and ingress
resource requests and limits
probes
autoscaling
node provisioning
workload identity
network policy
policy enforcement
secret management
backup and disaster recovery
GitOps operations

Projects to build:

EKS cluster with Terraform and Karpenter.
GitOps deployment workflow with Argo CD.
Kubernetes workload baseline chart with probes, resources, PDBs, HPA, and NetworkPolicy.
External Secrets integration with AWS Secrets Manager.
Velero backup and restore demo.

What "good" looks like:

Workloads declare health clearly.
Deployments can roll forward and roll back.
Secrets are not stored in plain Git.
Policies prevent unsafe defaults.
Teams can deploy without needing cluster-admin access.

2.4 Secure delivery systems#

The pipeline should not just deploy. It should protect the organization from bad changes.

Tools:

GitHub Actions
GitLab CI
Jenkins, enough to support legacy teams
Argo CD
Cosign
Syft
Grype
Trivy
SLSA concepts
OIDC federation
Dependabot / Renovate
SonarQube

Skills:

pipeline design
artifact versioning
environment promotion
OIDC-based cloud authentication
secret-free CI
container scanning
dependency scanning
SBOM generation
signed images
deployment approvals
rollback automation

Projects to build:

GitHub Actions pipeline that builds, scans, signs, and deploys a container.
OIDC authentication from GitHub Actions to AWS.
Progressive promotion from dev to staging to production.
Automated rollback on failed smoke test.

What "good" looks like:

CI does not use long-lived cloud keys.
Every artifact is traceable to a commit.
Vulnerability checks happen before deploy.
Failed releases stop early.
Rollback is documented and tested.

2.5 Observability and production signal#

Dashboards are not enough. You need signals that explain user impact and system behavior.

Tools:

OpenTelemetry
Prometheus
Grafana
Loki
Tempo
Alertmanager
CloudWatch
Sentry
Datadog or New Relic, enough to understand commercial observability platforms

Signals to master:

latency
traffic
errors
saturation
availability
request traces
dependency latency
queue depth
database connection pressure
retry storms
cost anomalies

Skills:

RED metrics
USE metrics
SLOs and error budgets
alert routing
alert severity design
dashboard design
distributed tracing
log correlation
practical runbooks

Projects to build:

OpenTelemetry-instrumented Node.js service.
Prometheus + Grafana dashboard for API latency, errors, and saturation.
SLO dashboard with burn-rate alerts.
Trace-based debugging demo across multiple services.

What "good" looks like:

Alerts map to user impact or real operational risk.
Dashboards answer specific questions.
Logs, metrics, and traces connect to the same incident.
On-call engineers know what action to take.

2.6 Reliability engineering and incident response#

This is where you become more than a deployment engineer.

Skills:

incident command
triage
rollback decisions
postmortems
failure mode analysis
dependency mapping
load testing
chaos testing
capacity planning
disaster recovery

Tools:

Grafana OnCall / PagerDuty / Opsgenie
Incident.io
Statuspage
k6
Locust
LitmusChaos
AWS Fault Injection Service

Projects to build:

Incident response playbook for a Kubernetes API outage.
Postmortem for a failed deployment.
Load test that reveals saturation limits.
DR drill restoring database backup into a fresh environment.

What "good" looks like:

Incidents have roles.
Rollback criteria are known before deployment.
Postmortems focus on learning, not blame.
Recovery steps are tested, not assumed.

2.7 Platform engineering and developer experience#

Platform engineering is about making the right path the easy path.

Tools:

Backstage
Score
Humanitec concepts
Crossplane
Kubernetes operators
Argo CD
Terraform modules
Internal templates

Skills:

golden paths
self-service environments
service catalog design
platform APIs
developer onboarding
paved-road templates
platform success metrics

Projects to build:

Backstage developer portal with service catalog.
Self-service app template that creates repo, pipeline, Helm chart, and observability defaults.
Platform API for provisioning a database or namespace.
Developer onboarding workflow measured from "new repo" to "running in staging."

What "good" looks like:

Developers can ship without learning every infrastructure detail.
Platform defaults include security, observability, and rollback behavior.
The platform reduces cognitive load.
Teams are not blocked by ticket queues for routine work.

2.8 Security and governance#

Security must be built into the path, not added at the end.

Tools:

IAM Access Analyzer
AWS Config
GuardDuty
Security Hub
KMS
Secrets Manager
Kyverno
OPA Gatekeeper
Trivy
Checkov
Cosign
Falco

Skills:

least privilege
workload identity
secret rotation
network segmentation
image scanning
policy as code
audit trails
secure CI/CD
incident response for security events

Projects to build:

Kubernetes policy pack that blocks privileged pods and missing resource limits.
CI pipeline with image scanning and SBOM output.
AWS IAM review workflow for excessive permissions.
Secret rotation demo with External Secrets Operator.

What "good" looks like:

Unsafe workloads are blocked by default.
Secrets are not copied manually.
Cloud access is short-lived and auditable.
Security checks happen before production.

2.9 Cost and efficiency#

Production systems must hold up financially too.

Tools:

AWS Cost Explorer
AWS Budgets
Compute Optimizer
Kubecost
Infracost
Karpenter
HPA / VPA

Skills:

right-sizing
autoscaling
cost allocation tags
idle resource detection
reserved capacity basics
spot capacity trade-offs
cost-aware architecture

Projects to build:

Kubernetes cost dashboard with Kubecost.
Terraform pull request cost estimate with Infracost.
Karpenter node consolidation demo.
AWS budget alerting and cost anomaly notification workflow.

What "good" looks like:

Teams know what services cost.
Waste is visible.
Autoscaling balances reliability and cost.
Cost controls do not surprise production workloads.

3. Technology Stack to Prioritize#

Tier 1: Must-have foundation#

Linux
Networking
Git
Docker
AWS
Terraform
Kubernetes
GitHub Actions
Bash
TypeScript / Node.js
SQL basics

Tier 2: Production platform stack#

EKS
Helm
Argo CD
Karpenter
External Secrets Operator
AWS Load Balancer Controller
Prometheus
Grafana
OpenTelemetry
Loki / Tempo
Kyverno or OPA Gatekeeper
Trivy
Checkov
Infracost

Tier 3: Differentiators#

Backstage
Crossplane
Kubernetes operators
Cosign and software supply chain security
SLO tooling
Incident management tooling
Chaos testing
AI-assisted operations and runbook generation
Platform APIs

4. Certifications to Target#

Do not collect certifications randomly. Use them as milestones for capability.

Cloud#

AWS Certified Solutions Architect - Associate#

Why:

Validates cloud architecture fundamentals.
Helps with networking, IAM, compute, storage, and resilience decisions.

AWS Certified DevOps Engineer - Professional#

Why:

Directly aligns with delivery automation, monitoring/logging, resilient cloud solutions, incident/event response, security, and compliance.
AWS says this exam validates expertise in provisioning, operating, and managing distributed systems and services on AWS.

Google Professional Cloud DevOps Engineer#

Why:

Strong SRE alignment.
Google describes the role as balancing reliability with delivery speed while optimizing production systems for performance and cost.

Kubernetes and cloud native#

CKA: Certified Kubernetes Administrator#

Why:

Proves hands-on Kubernetes operations ability.
Useful for cluster administration, troubleshooting, and workload operations.

CKS: Certified Kubernetes Security Specialist#

Why:

Best next step after CKA if you want to own production security.
Covers workload, cluster, supply chain, and runtime security.

PCA: Prometheus Certified Associate#

Why:

Helps prove monitoring and alerting fundamentals.

OTCA: OpenTelemetry Certified Associate#

Why:

Good fit for modern observability and telemetry pipelines.

CAPA: Certified Argo Project Associate#

Why:

Useful if you want GitOps to be a visible part of your platform story.

CNPA / CNPE: Cloud Native Platform Engineering#

Why:

CNPA validates platform engineering fundamentals like automation, security, observability, continuous delivery, platform APIs, and developer experience.
CNPE is performance-based and targets advanced platform engineering: GitOps/CD, self-service capabilities, observability/operations, security, and policy enforcement.

Infrastructure as Code#

Terraform Associate#

Why:

Validates Terraform fundamentals.

Terraform Authoring and Operations Professional#

Why:

Better long-term signal than Associate if you want to show advanced Terraform design and operations ability.

5. Resources to Study#

Reliability and operating production systems#

Google SRE books and resources
AWS Well-Architected Framework
AWS Operational Excellence Pillar
AWS Reliability Pillar
Google Cloud SRE guidance
DORA / State of DevOps research

What to extract:

SLOs
error budgets
alert quality
incident response
postmortem culture
reducing toil
release engineering
capacity planning

Cloud and architecture#

AWS Well-Architected Labs
AWS Architecture Center
AWS Builders' Library
Google Cloud Architecture Center
Azure Architecture Center, even if AWS is primary

What to extract:

trade-off thinking
blast radius reduction
multi-AZ design
backup and restore patterns
identity boundaries
operational readiness

Kubernetes and platform engineering#

Kubernetes official docs
CNCF landscape
CNCF platform engineering certifications and curricula
Argo CD docs
Helm docs
Karpenter docs
External Secrets Operator docs
Backstage docs

What to extract:

workload health
GitOps operation
cluster security
platform APIs
developer self-service
service catalogs

Observability#

OpenTelemetry docs
Prometheus docs
Grafana docs
Google SRE chapters on monitoring and alerting

What to extract:

metrics, logs, traces
telemetry pipelines
alert design
SLO dashboards
burn-rate alerts
correlation during incidents

Security#

AWS security best practices
Kubernetes security docs
OWASP Top 10
OWASP Kubernetes Top 10
SLSA supply chain security
Sigstore / Cosign docs

What to extract:

least privilege
secure CI/CD
secrets handling
image provenance
runtime detection
policy as code

Books#

Site Reliability Engineering by Google
The Site Reliability Workbook by Google
Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim
The Phoenix Project
The Unicorn Project
Designing Data-Intensive Applications
Release It!
Cloud Native DevOps with Kubernetes
Kubernetes Patterns
Infrastructure as Code by Kief Morris

6. Portfolio Projects to Build#

These projects should prove you can help teams, not just deploy tools.

Project 1: Production EKS Platform Baseline#

Goal:

Build a secure AWS EKS platform that a team could realistically use.

Include:

Terraform-managed VPC
EKS cluster
managed node groups or Karpenter
IAM roles for service accounts
External Secrets Operator
AWS Load Balancer Controller
cert-manager
Argo CD
Prometheus and Grafana
basic NetworkPolicy
example app deployment

Show:

architecture diagram
Terraform module structure
deployment flow
security decisions
failure modes
cost notes
runbook

Project 2: Secure CI/CD Delivery System#

Goal:

Build a pipeline that protects production.

Include:

GitHub Actions
Docker build
unit tests
linting
dependency scan
image scan
SBOM generation
image signing with Cosign
OIDC authentication to AWS
deploy to staging
smoke tests
manual approval for production
rollback workflow

Show:

why each gate exists
what happens on failure
how secrets are avoided
how artifacts are traced

Project 3: Observability and SLO Platform#

Goal:

Prove you can detect user-impacting problems before support tickets arrive.

Include:

OpenTelemetry instrumentation
metrics, logs, traces
Prometheus
Grafana
Loki
Tempo
SLO dashboard
burn-rate alerts
runbooks attached to alerts

Show:

good alert vs bad alert
dashboard screenshots
example incident walkthrough
trace showing a slow dependency

Project 4: Failure and Recovery Lab#

Goal:

Demonstrate recovery, not just deployment.

Include:

Kubernetes app
database
backup workflow
restore workflow
simulated failed deploy
simulated dependency latency
load test
incident timeline
postmortem

Show:

recovery time
what signal detected failure
what action recovered the system
what guardrail you added afterward

Project 5: Internal Developer Platform Starter#

Goal:

Show platform engineering and developer experience.

Include:

Backstage
service catalog
golden path template
app scaffold
default CI pipeline
default Helm chart
default observability dashboard
default runbook

Show:

how a developer creates a new service
what defaults are included
how the platform reduces cognitive load
how you measure platform success

Project 6: Policy-as-Code Guardrails#

Goal:

Show governance without blocking delivery unnecessarily.

Include:

Kyverno or OPA Gatekeeper
policies for privileged pods
policies for resource requests/limits
policies for required labels
policies for image registry rules
CI policy checks

Show:

unsafe deployment blocked
safe deployment allowed
exception process
policy documentation

Project 7: Cost-Aware Kubernetes Platform#

Goal:

Show that production reliability includes cost control.

Include:

Karpenter
Kubecost
Infracost
AWS Budgets
right-sizing recommendations
idle workload detection

Show:

before/after cost estimate
node consolidation
team-level cost visibility
trade-offs between availability and cost

7. Case Studies to Write#

Each case study should follow this structure:

Context
Problem
Constraints
Design options
Chosen architecture
Implementation
Failure modes
Operational guardrails
Results
Lessons learned

Case Study 1: Building a Production-Ready EKS Baseline#

Angle:

"Kubernetes is not maturity unless the platform is secure, observable, and recoverable."

Cover:

networking
IAM
GitOps
secrets
ingress
workload identity
observability
backup strategy

Case Study 2: From Green Pipeline to Safe Delivery#

Angle:

"A green pipeline only means the pipeline passed. It does not mean production is safe."

Cover:

scanning
signing
promotion
smoke tests
rollback
artifact traceability

Case Study 3: Reducing Alert Noise With Better Signals#

Angle:

"The goal is not more alerts. The goal is actionable signal."

Cover:

alert audit
SLOs
burn-rate alerts
runbooks
routing
severity levels

Case Study 4: Designing Terraform Modules That Do Not Surprise Production#

Angle:

"Terraform does exactly what you ask. The hard part is making sure you asked safely."

Cover:

module boundaries
plan review
drift detection
state safety
destructive change guardrails

Case Study 5: Recovering From a Bad Deployment#

Angle:

"Rollback is not a button. It is an operational design."

Cover:

deployment strategy
health checks
smoke tests
rollback criteria
incident timeline
postmortem

Case Study 6: Building a Developer Platform Golden Path#

Angle:

"The platform should make the secure, observable path the easiest path."

Cover:

service templates
CI defaults
observability defaults
documentation
ownership metadata
developer feedback

Case Study 7: Secrets Management Without Copy-Paste Risk#

Angle:

"Secret handling is an operations problem, not just a security checkbox."

Cover:

External Secrets Operator
AWS Secrets Manager
IAM roles
rotation
auditability

8. Blog Topics to Write#

These should position you as a practical platform reliability engineer.

Foundational posts#

Kubernetes is not maturity: what teams still get wrong after adopting it
Green pipelines are not safe deployments
Terraform did exactly what we asked: why plan review matters
What makes infrastructure "reviewable"?
Why production readiness starts before launch

Reliability posts#

SLOs explained without enterprise jargon
Alert fatigue: how to design alerts people trust
How to write a useful runbook
What a good postmortem actually includes
The difference between uptime and user trust

Platform engineering posts#

What is a golden path?
Internal developer platforms without buzzwords
Backstage: what problem does it actually solve?
Platform engineering is product work
Measuring platform success beyond deployment count

Cloud and Kubernetes posts#

EKS baseline: the pieces teams forget
IAM Roles for Service Accounts explained
Karpenter vs Cluster Autoscaler
External Secrets Operator in real workflows
Kubernetes probes: how bad health checks cause outages

CI/CD and supply chain posts#

OIDC in CI/CD: why long-lived AWS keys should disappear
Container image signing with Cosign
SBOMs explained for working engineers
Progressive delivery without overcomplication
Rollback strategy for small teams

AI and automation posts#

AI can write Terraform, but can it own the blast radius?
How I use AI to review infrastructure changes
The new DevOps skill: asking better operational questions
Automating toil without hiding risk
Why human judgment still matters in platform engineering

9. What to Measure#

Use metrics to learn, not to decorate dashboards.

Delivery metrics#

deployment frequency
lead time for changes
change failure rate
mean time to recovery

Reliability metrics#

availability
latency percentiles
error rate
saturation
SLO compliance
error budget burn

Platform metrics#

time to create a new service
time to first deploy
percentage of services using golden path
number of manual platform tickets reduced
developer satisfaction
onboarding time

Security metrics#

unresolved critical vulnerabilities
workloads missing resource limits
workloads using privileged permissions
secrets nearing expiration
CI jobs using long-lived credentials

Cost metrics#

idle compute
namespace/team cost
over-provisioned workloads
monthly cost trend
cost per environment

10. How to Talk About Yourself#

Avoid:

"I am a DevOps engineer who knows many tools."
"I deploy Kubernetes applications."
"I write Terraform and CI/CD."

Use:

"I build cloud platforms that make production safer to change."
"I design delivery systems with security, observability, and rollback built in."
"I help teams ship confidently by turning operational knowledge into guardrails."
"I use automation and AI to reduce toil, but keep human judgment around blast radius, security, and recovery."
"I care about the gap between green dashboards and real user experience."

11. Portfolio Positioning Checklist#

Your portfolio should prove:

You understand systems beyond tools.
You can design cloud foundations.
You can secure delivery.
You can observe production meaningfully.
You can recover from failure.
You can write clear runbooks and postmortems.
You can reduce developer friction.
You can explain trade-offs.
You can use automation without losing ownership.

For every project, include:

architecture diagram
problem statement
constraints
tools used
why those tools
security decisions
failure modes
observability design
rollback/recovery plan
lessons learned

12. Recommended Source Links#

Official certification and cloud-native sources#

AWS Certified DevOps Engineer - Professional: https://docs.aws.amazon.com/aws-certification/latest/userguide/devops-engineer-professional-02.html
CNCF Cloud Native Certifications: https://www.cncf.io/training/certification/
CNCF Certified Cloud Native Platform Engineering Associate: https://www.cncf.io/training/certification/cnpa/
CNCF Certified Cloud Native Platform Engineer: https://www.cncf.io/training/certification/cnpe/
HashiCorp Certifications: https://www.hashicorp.com/certification
Google Professional Cloud DevOps Engineer: https://cloud.google.com/learn/certification/cloud-devops-engineer

Reliability, observability, and operations#

AWS Well-Architected Framework: https://aws.amazon.com/architecture/well-architected/
AWS Operational Excellence Pillar: https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html
Google SRE resources: https://sre.google/resources/
Google Cloud SRE overview: https://cloud.google.com/sre
OpenTelemetry documentation: https://opentelemetry.io/docs/
DORA metrics overview: https://www.atlassian.com/devops/frameworks/dora-metrics