Cloud & DevOps

Kubernetes in Production: Complete DevOps Guide to Deployment & Scaling (2025)

Complete guide to production-grade Kubernetes deployments. Learn cluster architecture, CI/CD automation, monitoring, security hardening, and cost optimization strategies.

TEELI Team

TEAM

DevOps & Cloud Engineering Specialists

•

Jan 15, 2025

•

13 min read

Share this article:

Kubernetes production cluster architecture showing container orchestration pods services ingress load balancing and auto-scaling for cloud-native DevOps deployment 2025

Kubernetes in Production: The Complete DevOps Playbook

This guide covers everything needed to run Kubernetes reliably at scale—from cluster design to disaster recovery.

Why Kubernetes Won the Container War

The Pre-Kubernetes Era

2013-2015: Container Chaos

Manual container deployments with Docker CLI

Shell scripts for orchestration (fragile, non-portable)

No standardized networking or service discovery

Difficult to scale across multiple hosts

Early Orchestrators:

Docker Swarm: Simple but limited scalability

Apache Mesos: Powerful but complex to operate

AWS ECS: Proprietary, vendor lock-in

Kubernetes Advantages

Real-World Impact:

Spotify: Reduced deployment time from hours to 15 minutes

Pinterest: 80% reduction in infrastructure costs

Airbnb: Scaled from 100 to 1,000 microservices seamlessly

Kubernetes Architecture Deep Dive

Control Plane Components

1. API Server (kube-apiserver)

Role: Frontend for Kubernetes control plane

Functions:

Validates and processes REST API requests

Updates etcd with cluster state

Serves as authentication/authorization gateway

Production Best Practices:

Run 3-5 replicas for high availability

Enable audit logging (track all API calls)

Implement rate limiting to prevent abuse

2. etcd

Role: Distributed key-value store (cluster's source of truth)

Stores: All cluster configuration, state, and metadata

Production Best Practices:

Deploy 5-node etcd cluster (tolerates 2 failures)

Use separate etcd cluster for large deployments (>100 nodes)

Enable encryption at rest for sensitive data

Automated backups every hour to S3/GCS

Example: etcd Backup Script

```bash

#!/bin/bash

ETCDCTL_API=3 etcdctl snapshot save \

--endpoints=https://127.0.0.1:2379 \

--cacert=/etc/kubernetes/pki/etcd/ca.crt \

--cert=/etc/kubernetes/pki/etcd/server.crt \

--key=/etc/kubernetes/pki/etcd/server.key \

/backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db

Upload to S3

aws s3 cp /backup/etcd-snapshot-*.db s3://k8s-backups/etcd/

Retain only last 30 days

find /backup -name "etcd-snapshot-*.db" -mtime +30 -delete

```

3. Scheduler (kube-scheduler)

Role: Assigns pods to nodes based on resource availability

Factors Considered:

CPU/Memory requests and limits

Node affinity/anti-affinity rules

Taints and tolerations

Pod topology spread constraints

Production Optimization:

Custom schedulers for GPU workloads (NVIDIA GPU Operator)

Priority classes for critical workloads

Pod disruption budgets to prevent mass evictions

4. Controller Manager (kube-controller-manager)

Role: Runs control loops to maintain desired state

Controllers:

ReplicaSet Controller: Ensures pod replica count

Deployment Controller: Manages rolling updates

Service Controller: Creates LoadBalancers in cloud providers

Node Controller: Monitors node health

Worker Node Components

1. Kubelet

Role: Agent running on each node

Functions:

Receives pod specs from API server

Ensures containers are running in pods

Reports node/pod status back to control plane

Production Configuration:

```yaml

/var/lib/kubelet/config.yaml

apiVersion: kubelet.config.k8s.io/v1beta1

kind: KubeletConfiguration

maxPods: 110 # Default, adjust based on node size

podPidsLimit: 4096

containerLogMaxSize: "10Mi"

containerLogMaxFiles: 5

evictionHard:

memory.available: "200Mi"

nodefs.available: "10%"

```

2. Container Runtime

Options: containerd (default), CRI-O, Docker (deprecated)

containerd (Recommended):

Lightweight, CRI-native

Lower resource overhead vs Docker

Direct integration with Kubernetes

3. kube-proxy

Role: Network proxy on each node

Functions: Implements Service abstraction (load balancing)

Modes:

iptables (default): Simple, works everywhere

IPVS: Better performance for large clusters (>1000 services)

eBPF (Cilium): Highest performance, advanced features

Kubernetes cluster components showing control plane with API server etcd scheduler controller manager and worker nodes with kubelet kube-proxy container runtime pods 2025

Production Cluster Architecture Patterns

Pattern 1: Multi-Tier Cluster

Node Pools by Workload:

```yaml

System workloads (kube-system, monitoring)

apiVersion: v1

kind: Node

metadata:

labels:

node-role: system

node-type: on-demand

spec:

taints:

key: "dedicated"

value: "system"

effect: "NoSchedule"

Stateful workloads (databases, caches)

apiVersion: v1

kind: Node

metadata:

labels:

node-role: stateful

storage: ssd

node-type: on-demand # Never use spot for stateful!

Stateless apps (web servers, APIs)

apiVersion: v1

kind: Node

metadata:

labels:

node-role: stateless

node-type: spot # Save 70% with spot instances

```

Benefits:

Isolate critical system components

Optimize costs with spot instances for stateless workloads

Dedicated high-performance nodes for databases

Pattern 2: Multi-Cluster Strategy

When to Use Multiple Clusters:

Geographic Distribution: Low-latency for global users

Environment Isolation: Separate dev/staging/prod clusters

Compliance: Data residency requirements (GDPR)

Blast Radius Reduction: Limit impact of cluster failures

Multi-Tenancy: Hard isolation between teams/customers

Cluster Federation Tools:

Karmada: Multi-cluster orchestration (CNCF project)

Rancher: Multi-cluster management UI

ArgoCD + ApplicationSets: GitOps across clusters

Istio Multi-Cluster: Service mesh spanning clusters

Pattern 3: Hybrid Cloud Architecture

Example: AWS + On-Premises

```yaml

Cluster on AWS EKS (public cloud)

apiVersion: v1

kind: Cluster

metadata:

spec:

region: us-east-1

nodeGroups:

instanceTypes: [t3.large, t3a.large]

scaling:

min: 5

max: 50

Cluster on-premises (sensitive data)

apiVersion: v1

kind: Cluster

metadata:

spec:

controlPlaneEndpoint: 10.0.0.10:6443

workloads:

databases # Keep data on-premises for compliance

payment-processing

```

Cross-Cluster Communication:

Submariner: Direct pod-to-pod networking across clusters

Cloud VPN: Site-to-site VPN (AWS-to-on-prem)

Service Mesh: Istio multi-cluster mode

Multi-cluster Kubernetes architecture showing primary cluster in AWS secondary in GCP and on-premises cluster with service mesh interconnection and federated workload distribution 2025

CI/CD Pipelines for Kubernetes

GitOps Workflow

Philosophy: Git as the single source of truth for infrastructure and applications

Architecture:

Developers commit code to Git

CI pipeline builds Docker image, pushes to registry

Automated tool updates Kubernetes manifests in Git

GitOps operator (ArgoCD/Flux) detects changes and applies to cluster

Kubernetes converges to desired state

Example: ArgoCD Application

```yaml

apiVersion: argoproj.io/v1alpha1

kind: Application

metadata:

namespace: argocd

spec:

project: default

source:

repoURL: https://github.com/company/k8s-manifests

targetRevision: main

path: apps/web-app/overlays/production

destination:

server: https://kubernetes.default.svc

namespace: web-app

syncPolicy:

automated:

prune: true # Delete resources not in Git

selfHeal: true # Revert manual changes

syncOptions:

CreateNamespace=true

```

Benefits of GitOps:

Audit Trail: Every change tracked in Git history

Rollback: `git revert` to undo deployment

Disaster Recovery: Re-create entire cluster from Git

Multi-Environment: Separate branches/directories for dev/staging/prod

Complete CI/CD Pipeline

GitHub Actions Example:

```yaml

on:

push:

branches: [main]

jobs:

build-and-deploy:

runs-on: ubuntu-latest

steps:

uses: actions/checkout@v3

run: |

docker build -t myapp:${{ github.sha }} .

docker tag myapp:${{ github.sha }} myapp:latest

uses: aquasecurity/trivy-action@master

with:

image-ref: myapp:${{ github.sha }}

severity: 'CRITICAL,HIGH'

run: |

echo "${{ secrets.DOCKER_PASSWORD }}" | docker login -u "${{ secrets.DOCKER_USERNAME }}" --password-stdin

docker push myapp:${{ github.sha }}

docker push myapp:latest

run: |

cd k8s-manifests

kustomize edit set image myapp=myapp:${{ github.sha }}

git add .

git commit -m "Update image to ${{ github.sha }}"

git push

ArgoCD automatically detects Git change and deploys

```

Jenkins Pipeline (Alternative):

```groovy

pipeline {

agent { label 'docker' }

stages {

stage('Build') {

steps {

sh 'docker build -t myapp:${BUILD_NUMBER} .'

}

stage('Test') {

steps {

sh 'docker run myapp:${BUILD_NUMBER} npm test'

}

stage('Security Scan') {

steps {

sh 'trivy image myapp:${BUILD_NUMBER}'

}

stage('Deploy') {

when {

branch 'main'

}

steps {

withKubeConfig([credentialsId: 'k8s-prod']) {

sh '''

kubectl set image deployment/myapp \

myapp=myapp:${BUILD_NUMBER} \

--namespace=production

kubectl rollout status deployment/myapp \

--namespace=production \

--timeout=5m

'''

}

post {

failure {

slackSend(channel: '#deployments', color: 'danger',

message: "Deployment failed: ${env.JOB_NAME} #${env.BUILD_NUMBER}")

}

```

GitOps CI CD pipeline diagram showing Git commit Docker build image scan registry push ArgoCD sync and Kubernetes deployment with automated rollback capabilities 2025

Auto-Scaling Strategies

1. Horizontal Pod Autoscaler (HPA)

Scale based on metrics:

```yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 3

maxReplicas: 100

metrics:

type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

type: Resource

resource:

target:

type: Utilization

averageUtilization: 80

type: Pods

pods:

metric:

target:

type: AverageValue

averageValue: "1000"

behavior:

scaleDown:

stabilizationWindowSeconds: 300 # Wait 5min before scaling down

policies:

type: Percent

value: 50 # Scale down max 50% of pods at once

periodSeconds: 60

scaleUp:

stabilizationWindowSeconds: 0 # Scale up immediately

policies:

type: Percent

value: 100 # Double pods if needed

periodSeconds: 15

```

Custom Metrics with Prometheus:

```yaml

Install Prometheus Adapter first

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 5

maxReplicas: 50

metrics:

type: External

external:

metric:

selector:

matchLabels:

queue_name: payments

target:

type: AverageValue

averageValue: "30" # Scale if queue > 30 msgs/pod

```

2. Vertical Pod Autoscaler (VPA)

Automatically adjust CPU/memory requests:

```yaml

apiVersion: autoscaling.k8s.io/v1

kind: VerticalPodAutoscaler

metadata:

spec:

targetRef:

apiVersion: apps/v1

kind: StatefulSet

updatePolicy:

updateMode: "Auto" # Recreate pods with new resources

resourcePolicy:

containerPolicies:

containerName: postgres

minAllowed:

cpu: 1

memory: 2Gi

maxAllowed:

cpu: 8

memory: 32Gi

controlledResources: ["cpu", "memory"]

```

Use Cases:

Databases with variable workloads

Batch processing jobs

Prevent over/under-provisioning

3. Cluster Autoscaler

Add/remove nodes based on pending pods:

```yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: kube-system

spec:

replicas: 1

template:

spec:

containers:

image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0

command:

./cluster-autoscaler

--cloud-provider=aws

--namespace=kube-system

--nodes=3:20:worker-pool-1 # min:max:nodegroup

--scale-down-enabled=true

--scale-down-delay-after-add=10m

--scale-down-unneeded-time=10m

--skip-nodes-with-local-storage=false

env:

value: us-east-1

```

Best Practices:

Set realistic min/max node counts

Use pod disruption budgets to prevent disruptions

Monitor cluster autoscaler logs for scaling events

Combine with HPA for complete auto-scaling

4. KEDA (Kubernetes Event-Driven Autoscaling)

Scale based on external events:

```yaml

apiVersion: keda.sh/v1alpha1

kind: ScaledObject

metadata:

spec:

scaleTargetRef:

minReplicaCount: 1

maxReplicaCount: 100

triggers:

type: kafka

metadata:

bootstrapServers: kafka:9092

consumerGroup: my-group

topic: events

lagThreshold: "50" # Scale if lag > 50 messages

type: cron # Scale up before traffic spike

metadata:

timezone: America/New_York

start: 0 8 * * * # 8 AM daily

end: 0 18 * * * # 6 PM daily

desiredReplicas: "20"

```

Kubernetes auto-scaling layers showing HPA for pod scaling VPA for resource optimization cluster autoscaler for node management and KEDA for event-driven scaling 2025

Monitoring & Observability

The Three Pillars

1. Metrics (Prometheus + Grafana)

Install Prometheus Stack:

```bash

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \

--namespace monitoring --create-namespace \

--set prometheus.prometheusSpec.retention=30d \

--set grafana.adminPassword=SecurePassword123

```

Key Metrics to Monitor:

Cluster Health: Node CPU/memory, disk usage

Pod Health: Restart count, pod status (Pending, CrashLoopBackOff)

Application: Request rate, error rate, duration (RED method)

Resource Saturation: CPU throttling, OOM kills

Example: Custom Prometheus Alert

```yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

spec:

groups:

rules:

alert: HighErrorRate

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

/ sum(rate(http_requests_total[5m])) by (service) > 0.05

for: 5m

annotations:

summary: "High error rate detected: {{ $labels.service }}"

labels:

severity: critical

alert: PodCrashLooping

expr: rate(kube_pod_container_status_restarts_total[15m]) > 0

for: 15m

annotations:

summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} crash looping"

```

2. Logs (ELK / Loki)

Loki (Lightweight, integrates with Grafana):

```bash

helm install loki grafana/loki-stack \

--namespace monitoring \

--set grafana.enabled=false \

--set promtail.enabled=true

```

Log Aggregation Pattern:

Promtail: Collects logs from all pods

Loki: Stores and indexes logs

Grafana: Query and visualize logs

Example: Query Pod Logs in Grafana

```logql

{namespace="production", app="web-app"} |= "error" | json | level="error"

```

3. Traces (Jaeger / Tempo)

Distributed Tracing for Microservices:

```yaml

apiVersion: apps/v1

kind: Deployment

metadata:

spec:

template:

spec:

containers:

image: jaegertracing/all-in-one:latest

env:

value: "9411"

ports:

containerPort: 16686 # Jaeger UI

containerPort: 14268 # Collector

```

Instrument App with OpenTelemetry:

```python

from opentelemetry import trace

from opentelemetry.exporter.jaeger.thrift import JaegerExporter

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.trace.export import BatchSpanProcessor

Configure tracer

jaeger_exporter = JaegerExporter(

agent_host_name="jaeger",

agent_port=6831,

)

trace.set_tracer_provider(TracerProvider())

trace.get_tracer_provider().add_span_processor(

BatchSpanProcessor(jaeger_exporter)

)

tracer = trace.get_tracer(__name__)

Trace function

@app.route("/api/order")

def create_order():

with tracer.start_as_current_span("create_order"):

Business logic

with tracer.start_as_current_span("db_query"):

order = db.insert_order()

with tracer.start_as_current_span("send_email"):

send_confirmation_email(order)

return order

```

Kubernetes observability stack showing Prometheus for metrics Loki for logs Jaeger for traces with Grafana dashboard integration for unified monitoring 2025

Security Hardening

1. Network Policies

Deny all traffic by default:

```yaml

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

namespace: production

spec:

podSelector: {}

policyTypes:

Ingress

Egress

```

Allow specific traffic:

```yaml

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

spec:

podSelector:

matchLabels:

app: web-app

policyTypes:

Ingress

Egress

ingress:

from:

podSelector:

matchLabels:

app: nginx-ingress

ports:

protocol: TCP

port: 8080

egress:

to:

podSelector:

matchLabels:

app: postgres

ports:

protocol: TCP

port: 5432

to: # Allow DNS

namespaceSelector:

matchLabels:

ports:

protocol: UDP

port: 53

```

2. Pod Security Standards

Enforce restricted pod security:

```yaml

apiVersion: v1

kind: Namespace

metadata:

labels:

pod-security.kubernetes.io/enforce: restricted

pod-security.kubernetes.io/audit: restricted

pod-security.kubernetes.io/warn: restricted

```

Secure Pod Spec:

```yaml

apiVersion: v1

kind: Pod

metadata:

spec:

securityContext:

runAsNonRoot: true

runAsUser: 1000

fsGroup: 2000

seccompProfile:

type: RuntimeDefault

containers:

image: myapp:1.0

securityContext:

allowPrivilegeEscalation: false

readOnlyRootFilesystem: true

capabilities:

drop: ["ALL"]

resources:

limits:

cpu: "1"

memory: "512Mi"

requests:

cpu: "100m"

memory: "128Mi"

```

3. Secrets Management

External Secrets Operator (AWS Secrets Manager):

```yaml

apiVersion: external-secrets.io/v1beta1

kind: SecretStore

metadata:

spec:

provider:

aws:

service: SecretsManager

region: us-east-1

auth:

jwt:

serviceAccountRef:

---

apiVersion: external-secrets.io/v1beta1

kind: ExternalSecret

metadata:

spec:

refreshInterval: 1h

secretStoreRef:

target:

data:

secretKey: password

remoteRef:

key: prod/db/postgres

property: password

```

Sealed Secrets (GitOps-Friendly):

```bash

Encrypt secret before committing to Git

kubeseal --format yaml < secret.yaml > sealed-secret.yaml

Commit sealed-secret.yaml (safe to store in Git)

git add sealed-secret.yaml

git commit -m "Add encrypted database credentials"

```

4. RBAC (Role-Based Access Control)

Principle of Least Privilege:

```yaml

Role for developers (read-only)

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

namespace: development

rules:

apiGroups: ["", "apps", "batch"]

resources: ["pods", "deployments", "jobs", "logs"]

verbs: ["get", "list", "watch"]

apiGroups: [""]

resources: ["pods/exec", "pods/portforward"]

verbs: ["create"] # Allow debugging

---

Bind role to users

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

namespace: development

subjects:

kind: Group

apiGroup: rbac.authorization.k8s.io

roleRef:

kind: Role

apiGroup: rbac.authorization.k8s.io

```

Kubernetes security architecture showing network policies pod security RBAC secrets management and runtime security with Falco for threat detection 2025

Cost Optimization

1. Right-Sizing Resources

Problem: Over-provisioning wastes 30-50% of cloud costs

Solution: Goldilocks (VPA Recommendations)

```bash

helm install goldilocks fairwinds-stable/goldilocks \

--namespace goldilocks --create-namespace

Enable for namespace

kubectl label namespace production goldilocks.fairwinds.com/enabled=true

View recommendations

kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

Open http://localhost:8080

```

2. Spot Instances for Stateless Workloads

AWS Spot Instance Node Group:

```yaml

apiVersion: eksctl.io/v1alpha5

kind: ClusterConfig

metadata:

region: us-east-1

nodeGroups:

instancesDistribution:

instanceTypes: [t3.large, t3a.large, t2.large]

onDemandBaseCapacity: 0

onDemandPercentageAboveBaseCapacity: 0 # 100% spot

spotAllocationStrategy: capacity-optimized

minSize: 5

maxSize: 100

labels:

workload-type: stateless

taints:

key: "spot"

value: "true"

effect: "NoSchedule"

```

Pod Tolerations:

```yaml

apiVersion: apps/v1

kind: Deployment

metadata:

spec:

template:

spec:

tolerations:

key: "spot"

operator: "Equal"

value: "true"

effect: "NoSchedule"

affinity:

nodeAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

nodeSelectorTerms:

matchExpressions:

key: workload-type

operator: In

values: ["stateless"]

```

3. Cluster Resource Usage Monitoring

Kubecost (Cost Allocation):

```bash

helm install kubecost kubecost/cost-analyzer \

--namespace kubecost --create-namespace \

--set prometheus.server.global.external_labels.cluster_id=production

View dashboard

kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090

```

Insights Provided:

Cost per namespace/deployment/pod

Idle resource recommendations

Rightsizing suggestions

Spot vs on-demand cost breakdown

4. Storage Optimization

Use appropriate storage classes:

```yaml

Fast SSD for databases

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

provisioner: ebs.csi.aws.com

parameters:

type: gp3 # Latest generation SSD

iops: "3000"

throughput: "125"

volumeBindingMode: WaitForFirstConsumer

Cheap HDD for logs/backups

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

provisioner: ebs.csi.aws.com

parameters:

type: sc1 # Cold HDD (cheapest)

volumeBindingMode: WaitForFirstConsumer

```

Lifecycle Policies:

```yaml

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

annotations:

Move to Glacier after 30 days

aws-ebs-csi-driver/lifecycle-policy: glacier-30d

spec:

storageClassName: slow-hdd

accessModes: [ReadWriteOnce]

resources:

requests:

storage: 1Ti

```

Kubernetes cost optimization dashboard showing resource utilization spot instance savings right-sizing recommendations and storage tier distribution with monthly cost trends 2025

Disaster Recovery & Business Continuity

Backup Strategies

Velero (Cluster Backup):

```bash

Install Velero

velero install \

--provider aws \

--bucket k8s-backups \

--secret-file ./credentials-velero \

--backup-location-config region=us-east-1 \

--snapshot-location-config region=us-east-1

Schedule daily backups

velero schedule create daily-backup \

--schedule="0 2 * * *" \

--ttl 720h # Retain for 30 days

Backup specific namespace

velero backup create prod-backup --include-namespaces production

Restore from backup

velero restore create --from-backup prod-backup

```

What Velero Backs Up:

All Kubernetes resources (Deployments, Services, ConfigMaps)

Persistent Volume data (via snapshot APIs)

Namespace configuration

Multi-Region Failover

Active-Passive Setup:

```yaml

Primary cluster (us-east-1)

apiVersion: v1

kind: Service

metadata:

annotations:

external-dns.alpha.kubernetes.io/hostname: app.example.com

spec:

type: LoadBalancer

selector:

app: web-app

Secondary cluster (us-west-2) - standby

Use Route53 health checks to failover

```

Active-Active (Multi-Region):

Database Replication: PostgreSQL streaming replication, MySQL Group Replication

Object Storage: S3 cross-region replication

Traffic Management: AWS Route53 geolocation routing, Cloudflare Load Balancing

Data Consistency: Eventual consistency, conflict resolution strategies

Production Checklist

Pre-Deployment

[ ] Resource limits defined for all pods

[ ] Liveness and readiness probes configured

[ ] Pod disruption budgets set

[ ] Network policies enforced

[ ] Secrets stored in external secret manager

[ ] RBAC roles follow least privilege

[ ] Container images scanned for vulnerabilities (Trivy, Snyk)

[ ] Horizontal Pod Autoscaler configured

[ ] Monitoring and alerting set up (Prometheus, Grafana)

[ ] Logging aggregation configured (Loki, ELK)

[ ] Backup strategy tested (Velero)

Post-Deployment

[ ] Verify all pods running (`kubectl get pods`)

[ ] Check autoscaling behavior under load

[ ] Test disaster recovery (restore from backup)

[ ] Validate security policies (network isolation, RBAC)

[ ] Run chaos engineering tests (Chaos Mesh, Litmus)

[ ] Document runbooks for common incidents

[ ] Conduct post-deployment review

Conclusion: Kubernetes Mastery Path

Kubernetes production excellence requires continuous learning:

Months 1-3: Foundations

Complete CKAD/CKA certifications

Deploy personal projects to managed Kubernetes (GKE, EKS)

Master kubectl, YAML manifests, Helm charts

Months 4-6: Production Patterns

Implement GitOps with ArgoCD

Set up comprehensive monitoring (Prometheus, Grafana, Loki)

Practice incident response and debugging

Months 7-12: Advanced Topics

Multi-cluster management

Service mesh (Istio, Linkerd)

Advanced security (OPA/Gatekeeper policies, Falco runtime security)

Cost optimization strategies

Year 2+: Expertise

Contribute to CNCF projects

Run chaos engineering experiments

Architect multi-cloud Kubernetes platforms

Pursue CKS (Certified Kubernetes Security Specialist)

Remember: Kubernetes is a journey, not a destination. The ecosystem evolves rapidly—stay curious, keep learning, and always test in staging before production.

Kubernetes in Production: Complete DevOps Guide to Deployment & Scaling (2025)

Kubernetes in Production: The Complete DevOps Playbook

Why Kubernetes Won the Container War

The Pre-Kubernetes Era

Kubernetes Advantages

Kubernetes Architecture Deep Dive

Control Plane Components

Upload to S3

Retain only last 30 days

Worker Node Components

/var/lib/kubelet/config.yaml

Production Cluster Architecture Patterns

Pattern 1: Multi-Tier Cluster

System workloads (kube-system, monitoring)

Stateful workloads (databases, caches)

Stateless apps (web servers, APIs)

Pattern 2: Multi-Cluster Strategy

Pattern 3: Hybrid Cloud Architecture

Cluster on AWS EKS (public cloud)

Cluster on-premises (sensitive data)

CI/CD Pipelines for Kubernetes

GitOps Workflow

Complete CI/CD Pipeline

ArgoCD automatically detects Git change and deploys

Auto-Scaling Strategies

1. Horizontal Pod Autoscaler (HPA)

Install Prometheus Adapter first

2. Vertical Pod Autoscaler (VPA)

3. Cluster Autoscaler

4. KEDA (Kubernetes Event-Driven Autoscaling)

Monitoring & Observability

The Three Pillars

Configure tracer

Trace function

Business logic

Security Hardening

1. Network Policies

2. Pod Security Standards

3. Secrets Management

Encrypt secret before committing to Git

Commit sealed-secret.yaml (safe to store in Git)

4. RBAC (Role-Based Access Control)

Role for developers (read-only)

Bind role to users

Cost Optimization

1. Right-Sizing Resources

Enable for namespace

View recommendations

Open http://localhost:8080

2. Spot Instances for Stateless Workloads

3. Cluster Resource Usage Monitoring

View dashboard

4. Storage Optimization

Fast SSD for databases

Cheap HDD for logs/backups

Move to Glacier after 30 days

Disaster Recovery & Business Continuity

Backup Strategies

Install Velero

Schedule daily backups

Backup specific namespace

Restore from backup

Multi-Region Failover

Primary cluster (us-east-1)

Secondary cluster (us-west-2) - standby

Use Route53 health checks to failover

Production Checklist

Pre-Deployment

Post-Deployment

Conclusion: Kubernetes Mastery Path

FAQ — People Also Ask