Skip to content

I'm currently populating my catalog on the site. Pardon the prefilled data. The entries are actively being updated and cleaned up.

Previous website

The Big Picture Direction: Cloud Platforms That Hold Up in Production

This document is a big-picture direction for becoming the engineer described by this value proposition:

I build secure delivery systems, reliable infrastructure, and operational guardrails for teams that need more than YAML, dashboards, and green pipelines. The goal is simple: ship confidently, recover quickly, and own failure before users feel it.

The goal is not to become a tool collector. The goal is to become the person teams trust when production matters.


North Star#

Become the engineer who can answer these questions clearly:

  • Can we deploy this safely?
  • If it fails, how will we know?
  • If users are affected, what signal fires first?
  • Can we roll back?
  • Can we recover data?
  • Who owns this service?
  • What is the blast radius?
  • What guardrail prevents this mistake next time?
  • What should be automated?
  • What should still require human review?

That is how you become the person teams trust when the platform has to hold up in production.


1. The Mindset#

1.1 Think in outcomes, not tools#

Tools are only useful when they improve an operational outcome.

Do not say:

  • "I know Kubernetes."
  • "I know Terraform."
  • "I know GitHub Actions."

Say:

  • "I can make deployments safer."
  • "I can make infrastructure reviewable and recoverable."
  • "I can reduce mean time to recovery."
  • "I can design guardrails that stop common failure modes."
  • "I can help teams understand whether production is healthy."

1.2 Own failure end-to-end#

Production reliability is not only about preventing failure. It is about preparing for failure.

You should constantly ask:

  • What happens if this dependency slows down?
  • What happens if this deployment partially succeeds?
  • What happens if this Terraform change replaces a shared resource?
  • What happens if a secret expires?
  • What happens if a region or availability zone fails?
  • What signal tells us users are suffering before support tickets arrive?

1.3 Automate, but keep judgment in the loop#

AI and automation can write YAML, Terraform, scripts, and runbooks faster than humans.

Your value is knowing:

  • what should be automated
  • what should require review
  • what blast radius is acceptable
  • what signals prove the change worked
  • what rollback path exists
  • what security boundary must not be crossed

1.4 Prefer boring systems#

The best platform work often feels boring:

  • predictable deployments
  • boring rollbacks
  • boring alerts
  • boring infrastructure reviews
  • boring incident response
  • boring onboarding

Boring means the system is understandable under pressure.


2. Core Capability Map#

2.1 Cloud foundations#

You need enough cloud architecture depth to build environments that teams can safely use.

Focus areas:

  • AWS account structure and organization strategy
  • IAM roles, permission boundaries, and least privilege
  • VPC design, subnets, routing, NAT, security groups, NACLs
  • Load balancing and DNS
  • Secrets management
  • Compute choices: EC2, ECS, EKS, Lambda
  • Storage choices: S3, EBS, EFS
  • Database operations: RDS, backups, failover, read replicas
  • Multi-environment design: dev, staging, production

Tools and services to know:

  • AWS Organizations
  • IAM
  • VPC
  • Route 53
  • ACM
  • ALB / NLB
  • EKS
  • ECS
  • RDS
  • S3
  • Secrets Manager
  • Systems Manager Parameter Store
  • CloudWatch
  • CloudTrail
  • AWS Config
  • AWS Backup

What "good" looks like:

  • Environments are consistent.
  • IAM is scoped and reviewable.
  • Networking is understandable.
  • Production has backups, monitoring, and recovery paths.
  • Changes are made through code, not console-clicking.

2.2 Infrastructure as Code#

IaC is not valuable because it creates resources. It is valuable because it makes infrastructure reviewable, repeatable, and recoverable.

Tools:

  • Terraform
  • OpenTofu
  • Terragrunt, only when complexity justifies it
  • AWS CloudFormation, enough to understand AWS-native patterns
  • Checkov / tfsec / Trivy for IaC scanning
  • Infracost for cost visibility

Skills:

  • module design
  • remote state
  • state locking
  • environment composition
  • drift detection
  • plan review
  • policy checks
  • import and refactor strategy
  • safe destroy prevention

Projects to build:

  • Multi-account AWS baseline with Terraform.
  • VPC + EKS + RDS platform module.
  • Terraform plan review pipeline with security and cost checks.
  • Drift detection workflow that opens an issue when drift appears.

What "good" looks like:

  • Terraform modules are small and understandable.
  • Plans are reviewed before apply.
  • Destructive changes are guarded.
  • State is remote, locked, and backed up.
  • Environments share patterns without copy-paste chaos.

2.3 Kubernetes and cloud-native operations#

Kubernetes is not maturity by itself. It is only useful if workloads are easier to deploy, observe, scale, and recover.

Tools:

  • Kubernetes
  • Helm
  • Kustomize
  • Argo CD
  • ExternalDNS
  • cert-manager
  • AWS Load Balancer Controller
  • Karpenter
  • Metrics Server
  • Prometheus
  • Grafana
  • OpenTelemetry
  • Kyverno or OPA Gatekeeper
  • Sealed Secrets or External Secrets Operator
  • Velero

Skills:

  • pod lifecycle
  • deployments and rollouts
  • services and ingress
  • resource requests and limits
  • probes
  • autoscaling
  • node provisioning
  • workload identity
  • network policy
  • policy enforcement
  • secret management
  • backup and disaster recovery
  • GitOps operations

Projects to build:

  • EKS cluster with Terraform and Karpenter.
  • GitOps deployment workflow with Argo CD.
  • Kubernetes workload baseline chart with probes, resources, PDBs, HPA, and NetworkPolicy.
  • External Secrets integration with AWS Secrets Manager.
  • Velero backup and restore demo.

What "good" looks like:

  • Workloads declare health clearly.
  • Deployments can roll forward and roll back.
  • Secrets are not stored in plain Git.
  • Policies prevent unsafe defaults.
  • Teams can deploy without needing cluster-admin access.

2.4 Secure delivery systems#

The pipeline should not just deploy. It should protect the organization from bad changes.

Tools:

  • GitHub Actions
  • GitLab CI
  • Jenkins, enough to support legacy teams
  • Argo CD
  • Cosign
  • Syft
  • Grype
  • Trivy
  • SLSA concepts
  • OIDC federation
  • Dependabot / Renovate
  • SonarQube

Skills:

  • pipeline design
  • artifact versioning
  • environment promotion
  • OIDC-based cloud authentication
  • secret-free CI
  • container scanning
  • dependency scanning
  • SBOM generation
  • signed images
  • deployment approvals
  • rollback automation

Projects to build:

  • GitHub Actions pipeline that builds, scans, signs, and deploys a container.
  • OIDC authentication from GitHub Actions to AWS.
  • Progressive promotion from dev to staging to production.
  • Automated rollback on failed smoke test.

What "good" looks like:

  • CI does not use long-lived cloud keys.
  • Every artifact is traceable to a commit.
  • Vulnerability checks happen before deploy.
  • Failed releases stop early.
  • Rollback is documented and tested.

2.5 Observability and production signal#

Dashboards are not enough. You need signals that explain user impact and system behavior.

Tools:

  • OpenTelemetry
  • Prometheus
  • Grafana
  • Loki
  • Tempo
  • Alertmanager
  • CloudWatch
  • Sentry
  • Datadog or New Relic, enough to understand commercial observability platforms

Signals to master:

  • latency
  • traffic
  • errors
  • saturation
  • availability
  • request traces
  • dependency latency
  • queue depth
  • database connection pressure
  • retry storms
  • cost anomalies

Skills:

  • RED metrics
  • USE metrics
  • SLOs and error budgets
  • alert routing
  • alert severity design
  • dashboard design
  • distributed tracing
  • log correlation
  • practical runbooks

Projects to build:

  • OpenTelemetry-instrumented Node.js service.
  • Prometheus + Grafana dashboard for API latency, errors, and saturation.
  • SLO dashboard with burn-rate alerts.
  • Trace-based debugging demo across multiple services.

What "good" looks like:

  • Alerts map to user impact or real operational risk.
  • Dashboards answer specific questions.
  • Logs, metrics, and traces connect to the same incident.
  • On-call engineers know what action to take.

2.6 Reliability engineering and incident response#

This is where you become more than a deployment engineer.

Skills:

  • incident command
  • triage
  • rollback decisions
  • postmortems
  • failure mode analysis
  • dependency mapping
  • load testing
  • chaos testing
  • capacity planning
  • disaster recovery

Tools:

  • Grafana OnCall / PagerDuty / Opsgenie
  • Incident.io
  • Statuspage
  • k6
  • Locust
  • LitmusChaos
  • AWS Fault Injection Service

Projects to build:

  • Incident response playbook for a Kubernetes API outage.
  • Postmortem for a failed deployment.
  • Load test that reveals saturation limits.
  • DR drill restoring database backup into a fresh environment.

What "good" looks like:

  • Incidents have roles.
  • Rollback criteria are known before deployment.
  • Postmortems focus on learning, not blame.
  • Recovery steps are tested, not assumed.

2.7 Platform engineering and developer experience#

Platform engineering is about making the right path the easy path.

Tools:

  • Backstage
  • Score
  • Humanitec concepts
  • Crossplane
  • Kubernetes operators
  • Argo CD
  • Terraform modules
  • Internal templates

Skills:

  • golden paths
  • self-service environments
  • service catalog design
  • platform APIs
  • developer onboarding
  • paved-road templates
  • platform success metrics

Projects to build:

  • Backstage developer portal with service catalog.
  • Self-service app template that creates repo, pipeline, Helm chart, and observability defaults.
  • Platform API for provisioning a database or namespace.
  • Developer onboarding workflow measured from "new repo" to "running in staging."

What "good" looks like:

  • Developers can ship without learning every infrastructure detail.
  • Platform defaults include security, observability, and rollback behavior.
  • The platform reduces cognitive load.
  • Teams are not blocked by ticket queues for routine work.

2.8 Security and governance#

Security must be built into the path, not added at the end.

Tools:

  • IAM Access Analyzer
  • AWS Config
  • GuardDuty
  • Security Hub
  • KMS
  • Secrets Manager
  • Kyverno
  • OPA Gatekeeper
  • Trivy
  • Checkov
  • Cosign
  • Falco

Skills:

  • least privilege
  • workload identity
  • secret rotation
  • network segmentation
  • image scanning
  • policy as code
  • audit trails
  • secure CI/CD
  • incident response for security events

Projects to build:

  • Kubernetes policy pack that blocks privileged pods and missing resource limits.
  • CI pipeline with image scanning and SBOM output.
  • AWS IAM review workflow for excessive permissions.
  • Secret rotation demo with External Secrets Operator.

What "good" looks like:

  • Unsafe workloads are blocked by default.
  • Secrets are not copied manually.
  • Cloud access is short-lived and auditable.
  • Security checks happen before production.

2.9 Cost and efficiency#

Production systems must hold up financially too.

Tools:

  • AWS Cost Explorer
  • AWS Budgets
  • Compute Optimizer
  • Kubecost
  • Infracost
  • Karpenter
  • HPA / VPA

Skills:

  • right-sizing
  • autoscaling
  • cost allocation tags
  • idle resource detection
  • reserved capacity basics
  • spot capacity trade-offs
  • cost-aware architecture

Projects to build:

  • Kubernetes cost dashboard with Kubecost.
  • Terraform pull request cost estimate with Infracost.
  • Karpenter node consolidation demo.
  • AWS budget alerting and cost anomaly notification workflow.

What "good" looks like:

  • Teams know what services cost.
  • Waste is visible.
  • Autoscaling balances reliability and cost.
  • Cost controls do not surprise production workloads.

3. Technology Stack to Prioritize#

Tier 1: Must-have foundation#

  • Linux
  • Networking
  • Git
  • Docker
  • AWS
  • Terraform
  • Kubernetes
  • GitHub Actions
  • Bash
  • TypeScript / Node.js
  • SQL basics

Tier 2: Production platform stack#

  • EKS
  • Helm
  • Argo CD
  • Karpenter
  • External Secrets Operator
  • AWS Load Balancer Controller
  • Prometheus
  • Grafana
  • OpenTelemetry
  • Loki / Tempo
  • Kyverno or OPA Gatekeeper
  • Trivy
  • Checkov
  • Infracost

Tier 3: Differentiators#

  • Backstage
  • Crossplane
  • Kubernetes operators
  • Cosign and software supply chain security
  • SLO tooling
  • Incident management tooling
  • Chaos testing
  • AI-assisted operations and runbook generation
  • Platform APIs

4. Certifications to Target#

Do not collect certifications randomly. Use them as milestones for capability.

Cloud#

AWS Certified Solutions Architect - Associate#

Why:

  • Validates cloud architecture fundamentals.
  • Helps with networking, IAM, compute, storage, and resilience decisions.

AWS Certified DevOps Engineer - Professional#

Why:

  • Directly aligns with delivery automation, monitoring/logging, resilient cloud solutions, incident/event response, security, and compliance.
  • AWS says this exam validates expertise in provisioning, operating, and managing distributed systems and services on AWS.

Google Professional Cloud DevOps Engineer#

Why:

  • Strong SRE alignment.
  • Google describes the role as balancing reliability with delivery speed while optimizing production systems for performance and cost.

Kubernetes and cloud native#

CKA: Certified Kubernetes Administrator#

Why:

  • Proves hands-on Kubernetes operations ability.
  • Useful for cluster administration, troubleshooting, and workload operations.

CKS: Certified Kubernetes Security Specialist#

Why:

  • Best next step after CKA if you want to own production security.
  • Covers workload, cluster, supply chain, and runtime security.

PCA: Prometheus Certified Associate#

Why:

  • Helps prove monitoring and alerting fundamentals.

OTCA: OpenTelemetry Certified Associate#

Why:

  • Good fit for modern observability and telemetry pipelines.

CAPA: Certified Argo Project Associate#

Why:

  • Useful if you want GitOps to be a visible part of your platform story.

CNPA / CNPE: Cloud Native Platform Engineering#

Why:

  • CNPA validates platform engineering fundamentals like automation, security, observability, continuous delivery, platform APIs, and developer experience.
  • CNPE is performance-based and targets advanced platform engineering: GitOps/CD, self-service capabilities, observability/operations, security, and policy enforcement.

Infrastructure as Code#

Terraform Associate#

Why:

  • Validates Terraform fundamentals.

Terraform Authoring and Operations Professional#

Why:

  • Better long-term signal than Associate if you want to show advanced Terraform design and operations ability.

5. Resources to Study#

Reliability and operating production systems#

  • Google SRE books and resources
  • AWS Well-Architected Framework
  • AWS Operational Excellence Pillar
  • AWS Reliability Pillar
  • Google Cloud SRE guidance
  • DORA / State of DevOps research

What to extract:

  • SLOs
  • error budgets
  • alert quality
  • incident response
  • postmortem culture
  • reducing toil
  • release engineering
  • capacity planning

Cloud and architecture#

  • AWS Well-Architected Labs
  • AWS Architecture Center
  • AWS Builders' Library
  • Google Cloud Architecture Center
  • Azure Architecture Center, even if AWS is primary

What to extract:

  • trade-off thinking
  • blast radius reduction
  • multi-AZ design
  • backup and restore patterns
  • identity boundaries
  • operational readiness

Kubernetes and platform engineering#

  • Kubernetes official docs
  • CNCF landscape
  • CNCF platform engineering certifications and curricula
  • Argo CD docs
  • Helm docs
  • Karpenter docs
  • External Secrets Operator docs
  • Backstage docs

What to extract:

  • workload health
  • GitOps operation
  • cluster security
  • platform APIs
  • developer self-service
  • service catalogs

Observability#

  • OpenTelemetry docs
  • Prometheus docs
  • Grafana docs
  • Google SRE chapters on monitoring and alerting

What to extract:

  • metrics, logs, traces
  • telemetry pipelines
  • alert design
  • SLO dashboards
  • burn-rate alerts
  • correlation during incidents

Security#

  • AWS security best practices
  • Kubernetes security docs
  • OWASP Top 10
  • OWASP Kubernetes Top 10
  • SLSA supply chain security
  • Sigstore / Cosign docs

What to extract:

  • least privilege
  • secure CI/CD
  • secrets handling
  • image provenance
  • runtime detection
  • policy as code

Books#

  • Site Reliability Engineering by Google
  • The Site Reliability Workbook by Google
  • Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim
  • The Phoenix Project
  • The Unicorn Project
  • Designing Data-Intensive Applications
  • Release It!
  • Cloud Native DevOps with Kubernetes
  • Kubernetes Patterns
  • Infrastructure as Code by Kief Morris

6. Portfolio Projects to Build#

These projects should prove you can help teams, not just deploy tools.

Project 1: Production EKS Platform Baseline#

Goal:

Build a secure AWS EKS platform that a team could realistically use.

Include:

  • Terraform-managed VPC
  • EKS cluster
  • managed node groups or Karpenter
  • IAM roles for service accounts
  • External Secrets Operator
  • AWS Load Balancer Controller
  • cert-manager
  • Argo CD
  • Prometheus and Grafana
  • basic NetworkPolicy
  • example app deployment

Show:

  • architecture diagram
  • Terraform module structure
  • deployment flow
  • security decisions
  • failure modes
  • cost notes
  • runbook

Project 2: Secure CI/CD Delivery System#

Goal:

Build a pipeline that protects production.

Include:

  • GitHub Actions
  • Docker build
  • unit tests
  • linting
  • dependency scan
  • image scan
  • SBOM generation
  • image signing with Cosign
  • OIDC authentication to AWS
  • deploy to staging
  • smoke tests
  • manual approval for production
  • rollback workflow

Show:

  • why each gate exists
  • what happens on failure
  • how secrets are avoided
  • how artifacts are traced

Project 3: Observability and SLO Platform#

Goal:

Prove you can detect user-impacting problems before support tickets arrive.

Include:

  • OpenTelemetry instrumentation
  • metrics, logs, traces
  • Prometheus
  • Grafana
  • Loki
  • Tempo
  • SLO dashboard
  • burn-rate alerts
  • runbooks attached to alerts

Show:

  • good alert vs bad alert
  • dashboard screenshots
  • example incident walkthrough
  • trace showing a slow dependency

Project 4: Failure and Recovery Lab#

Goal:

Demonstrate recovery, not just deployment.

Include:

  • Kubernetes app
  • database
  • backup workflow
  • restore workflow
  • simulated failed deploy
  • simulated dependency latency
  • load test
  • incident timeline
  • postmortem

Show:

  • recovery time
  • what signal detected failure
  • what action recovered the system
  • what guardrail you added afterward

Project 5: Internal Developer Platform Starter#

Goal:

Show platform engineering and developer experience.

Include:

  • Backstage
  • service catalog
  • golden path template
  • app scaffold
  • default CI pipeline
  • default Helm chart
  • default observability dashboard
  • default runbook

Show:

  • how a developer creates a new service
  • what defaults are included
  • how the platform reduces cognitive load
  • how you measure platform success

Project 6: Policy-as-Code Guardrails#

Goal:

Show governance without blocking delivery unnecessarily.

Include:

  • Kyverno or OPA Gatekeeper
  • policies for privileged pods
  • policies for resource requests/limits
  • policies for required labels
  • policies for image registry rules
  • CI policy checks

Show:

  • unsafe deployment blocked
  • safe deployment allowed
  • exception process
  • policy documentation

Project 7: Cost-Aware Kubernetes Platform#

Goal:

Show that production reliability includes cost control.

Include:

  • Karpenter
  • Kubecost
  • Infracost
  • AWS Budgets
  • right-sizing recommendations
  • idle workload detection

Show:

  • before/after cost estimate
  • node consolidation
  • team-level cost visibility
  • trade-offs between availability and cost

7. Case Studies to Write#

Each case study should follow this structure:

  1. Context
  2. Problem
  3. Constraints
  4. Design options
  5. Chosen architecture
  6. Implementation
  7. Failure modes
  8. Operational guardrails
  9. Results
  10. Lessons learned

Case Study 1: Building a Production-Ready EKS Baseline#

Angle:

"Kubernetes is not maturity unless the platform is secure, observable, and recoverable."

Cover:

  • networking
  • IAM
  • GitOps
  • secrets
  • ingress
  • workload identity
  • observability
  • backup strategy

Case Study 2: From Green Pipeline to Safe Delivery#

Angle:

"A green pipeline only means the pipeline passed. It does not mean production is safe."

Cover:

  • scanning
  • signing
  • promotion
  • smoke tests
  • rollback
  • artifact traceability

Case Study 3: Reducing Alert Noise With Better Signals#

Angle:

"The goal is not more alerts. The goal is actionable signal."

Cover:

  • alert audit
  • SLOs
  • burn-rate alerts
  • runbooks
  • routing
  • severity levels

Case Study 4: Designing Terraform Modules That Do Not Surprise Production#

Angle:

"Terraform does exactly what you ask. The hard part is making sure you asked safely."

Cover:

  • module boundaries
  • plan review
  • drift detection
  • state safety
  • destructive change guardrails

Case Study 5: Recovering From a Bad Deployment#

Angle:

"Rollback is not a button. It is an operational design."

Cover:

  • deployment strategy
  • health checks
  • smoke tests
  • rollback criteria
  • incident timeline
  • postmortem

Case Study 6: Building a Developer Platform Golden Path#

Angle:

"The platform should make the secure, observable path the easiest path."

Cover:

  • service templates
  • CI defaults
  • observability defaults
  • documentation
  • ownership metadata
  • developer feedback

Case Study 7: Secrets Management Without Copy-Paste Risk#

Angle:

"Secret handling is an operations problem, not just a security checkbox."

Cover:

  • External Secrets Operator
  • AWS Secrets Manager
  • IAM roles
  • rotation
  • auditability

8. Blog Topics to Write#

These should position you as a practical platform reliability engineer.

Foundational posts#

  • Kubernetes is not maturity: what teams still get wrong after adopting it
  • Green pipelines are not safe deployments
  • Terraform did exactly what we asked: why plan review matters
  • What makes infrastructure "reviewable"?
  • Why production readiness starts before launch

Reliability posts#

  • SLOs explained without enterprise jargon
  • Alert fatigue: how to design alerts people trust
  • How to write a useful runbook
  • What a good postmortem actually includes
  • The difference between uptime and user trust

Platform engineering posts#

  • What is a golden path?
  • Internal developer platforms without buzzwords
  • Backstage: what problem does it actually solve?
  • Platform engineering is product work
  • Measuring platform success beyond deployment count

Cloud and Kubernetes posts#

  • EKS baseline: the pieces teams forget
  • IAM Roles for Service Accounts explained
  • Karpenter vs Cluster Autoscaler
  • External Secrets Operator in real workflows
  • Kubernetes probes: how bad health checks cause outages

CI/CD and supply chain posts#

  • OIDC in CI/CD: why long-lived AWS keys should disappear
  • Container image signing with Cosign
  • SBOMs explained for working engineers
  • Progressive delivery without overcomplication
  • Rollback strategy for small teams

AI and automation posts#

  • AI can write Terraform, but can it own the blast radius?
  • How I use AI to review infrastructure changes
  • The new DevOps skill: asking better operational questions
  • Automating toil without hiding risk
  • Why human judgment still matters in platform engineering

9. What to Measure#

Use metrics to learn, not to decorate dashboards.

Delivery metrics#

  • deployment frequency
  • lead time for changes
  • change failure rate
  • mean time to recovery

Reliability metrics#

  • availability
  • latency percentiles
  • error rate
  • saturation
  • SLO compliance
  • error budget burn

Platform metrics#

  • time to create a new service
  • time to first deploy
  • percentage of services using golden path
  • number of manual platform tickets reduced
  • developer satisfaction
  • onboarding time

Security metrics#

  • unresolved critical vulnerabilities
  • workloads missing resource limits
  • workloads using privileged permissions
  • secrets nearing expiration
  • CI jobs using long-lived credentials

Cost metrics#

  • idle compute
  • namespace/team cost
  • over-provisioned workloads
  • monthly cost trend
  • cost per environment

10. How to Talk About Yourself#

Avoid:

  • "I am a DevOps engineer who knows many tools."
  • "I deploy Kubernetes applications."
  • "I write Terraform and CI/CD."

Use:

  • "I build cloud platforms that make production safer to change."
  • "I design delivery systems with security, observability, and rollback built in."
  • "I help teams ship confidently by turning operational knowledge into guardrails."
  • "I use automation and AI to reduce toil, but keep human judgment around blast radius, security, and recovery."
  • "I care about the gap between green dashboards and real user experience."

11. Portfolio Positioning Checklist#

Your portfolio should prove:

  • You understand systems beyond tools.
  • You can design cloud foundations.
  • You can secure delivery.
  • You can observe production meaningfully.
  • You can recover from failure.
  • You can write clear runbooks and postmortems.
  • You can reduce developer friction.
  • You can explain trade-offs.
  • You can use automation without losing ownership.

For every project, include:

  • architecture diagram
  • problem statement
  • constraints
  • tools used
  • why those tools
  • security decisions
  • failure modes
  • observability design
  • rollback/recovery plan
  • lessons learned

Official certification and cloud-native sources#

Reliability, observability, and operations#