Kubernetes in Production: Complete DevOps Guide to Deployment & Scaling (2025)
Complete guide to production-grade Kubernetes deployments. Learn cluster architecture, CI/CD automation, monitoring, security hardening, and cost optimization strategies.

Kubernetes in Production: The Complete DevOps Playbook
This guide covers everything needed to run Kubernetes reliably at scale—from cluster design to disaster recovery.
Why Kubernetes Won the Container War
The Pre-Kubernetes Era
Kubernetes Advantages
Kubernetes Architecture Deep Dive
Control Plane Components
```bash
#!/bin/bash
ETCDCTL_API=3 etcdctl snapshot save \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
/backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
Upload to S3
aws s3 cp /backup/etcd-snapshot-*.db s3://k8s-backups/etcd/
Retain only last 30 days
find /backup -name "etcd-snapshot-*.db" -mtime +30 -delete
```
Worker Node Components
```yaml
/var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 110 # Default, adjust based on node size
podPidsLimit: 4096
containerLogMaxSize: "10Mi"
containerLogMaxFiles: 5
evictionHard:
memory.available: "200Mi"
nodefs.available: "10%"
```
Production Cluster Architecture Patterns
Pattern 1: Multi-Tier Cluster
```yaml
System workloads (kube-system, monitoring)
apiVersion: v1
kind: Node
metadata:
labels:
node-role: system
node-type: on-demand
spec:
taints:
value: "system"
effect: "NoSchedule"
Stateful workloads (databases, caches)
apiVersion: v1
kind: Node
metadata:
labels:
node-role: stateful
storage: ssd
node-type: on-demand # Never use spot for stateful!
Stateless apps (web servers, APIs)
apiVersion: v1
kind: Node
metadata:
labels:
node-role: stateless
node-type: spot # Save 70% with spot instances
```
Pattern 2: Multi-Cluster Strategy
Pattern 3: Hybrid Cloud Architecture
```yaml
Cluster on AWS EKS (public cloud)
apiVersion: v1
kind: Cluster
metadata:
name: production-aws
spec:
region: us-east-1
nodeGroups:
instanceTypes: [t3.large, t3a.large]
scaling:
min: 5
max: 50
Cluster on-premises (sensitive data)
apiVersion: v1
kind: Cluster
metadata:
name: production-onprem
spec:
controlPlaneEndpoint: 10.0.0.10:6443
workloads:
```
CI/CD Pipelines for Kubernetes
GitOps Workflow
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: web-app-production
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/k8s-manifests
targetRevision: main
path: apps/web-app/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: web-app
syncPolicy:
automated:
prune: true # Delete resources not in Git
selfHeal: true # Revert manual changes
syncOptions:
```
Complete CI/CD Pipeline
```yaml
name: Deploy to Kubernetes
on:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v3
run: |
docker build -t myapp:${{ github.sha }} .
docker tag myapp:${{ github.sha }} myapp:latest
uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:${{ github.sha }}
severity: 'CRITICAL,HIGH'
run: |
echo "${{ secrets.DOCKER_PASSWORD }}" | docker login -u "${{ secrets.DOCKER_USERNAME }}" --password-stdin
docker push myapp:${{ github.sha }}
docker push myapp:latest
run: |
cd k8s-manifests
kustomize edit set image myapp=myapp:${{ github.sha }}
git add .
git commit -m "Update image to ${{ github.sha }}"
git push
ArgoCD automatically detects Git change and deploys
```
```groovy
pipeline {
agent { label 'docker' }
stages {
stage('Build') {
steps {
sh 'docker build -t myapp:${BUILD_NUMBER} .'
}
}
stage('Test') {
steps {
sh 'docker run myapp:${BUILD_NUMBER} npm test'
}
}
stage('Security Scan') {
steps {
sh 'trivy image myapp:${BUILD_NUMBER}'
}
}
stage('Deploy') {
when {
branch 'main'
}
steps {
withKubeConfig([credentialsId: 'k8s-prod']) {
sh '''
kubectl set image deployment/myapp \
myapp=myapp:${BUILD_NUMBER} \
--namespace=production
kubectl rollout status deployment/myapp \
--namespace=production \
--timeout=5m
'''
}
}
}
}
post {
failure {
slackSend(channel: '#deployments', color: 'danger',
message: "Deployment failed: ${env.JOB_NAME} #${env.BUILD_NUMBER}")
}
}
}
```
Auto-Scaling Strategies
1. Horizontal Pod Autoscaler (HPA)
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 100
metrics:
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
policies:
value: 50 # Scale down max 50% of pods at once
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
value: 100 # Double pods if needed
periodSeconds: 15
```
```yaml
Install Prometheus Adapter first
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
minReplicas: 5
maxReplicas: 50
metrics:
external:
metric:
name: sqs_queue_length # AWS SQS queue depth
selector:
matchLabels:
queue_name: payments
target:
type: AverageValue
averageValue: "30" # Scale if queue > 30 msgs/pod
```
2. Vertical Pod Autoscaler (VPA)
```yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: db-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: postgres
updatePolicy:
updateMode: "Auto" # Recreate pods with new resources
resourcePolicy:
containerPolicies:
minAllowed:
cpu: 1
memory: 2Gi
maxAllowed:
cpu: 8
memory: 32Gi
controlledResources: ["cpu", "memory"]
```
3. Cluster Autoscaler
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
template:
spec:
containers:
name: cluster-autoscaler
command:
env:
value: us-east-1
```
4. KEDA (Kubernetes Event-Driven Autoscaling)
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
spec:
scaleTargetRef:
name: event-processor
minReplicaCount: 1
maxReplicaCount: 100
triggers:
metadata:
bootstrapServers: kafka:9092
consumerGroup: my-group
topic: events
lagThreshold: "50" # Scale if lag > 50 messages
metadata:
timezone: America/New_York
start: 0 8 * * * # 8 AM daily
end: 0 18 * * * # 6 PM daily
desiredReplicas: "20"
```
Monitoring & Observability
The Three Pillars
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set grafana.adminPassword=SecurePassword123
```
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
spec:
groups:
rules:
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
annotations:
summary: "High error rate detected: {{ $labels.service }}"
labels:
severity: critical
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} crash looping"
```
```bash
helm install loki grafana/loki-stack \
--namespace monitoring \
--set grafana.enabled=false \
--set promtail.enabled=true
```
```logql
{namespace="production", app="web-app"} |= "error" | json | level="error"
```
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
template:
spec:
containers:
image: jaegertracing/all-in-one:latest
env:
value: "9411"
ports:
```
```python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
Configure tracer
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
Trace function
@app.route("/api/order")
def create_order():
with tracer.start_as_current_span("create_order"):
Business logic
with tracer.start_as_current_span("db_query"):
order = db.insert_order()
with tracer.start_as_current_span("send_email"):
send_confirmation_email(order)
return order
```
Security Hardening
1. Network Policies
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
```
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-app-policy
spec:
podSelector:
matchLabels:
app: web-app
policyTypes:
ingress:
matchLabels:
app: nginx-ingress
ports:
port: 8080
egress:
matchLabels:
app: postgres
ports:
port: 5432
matchLabels:
name: kube-system
ports:
port: 53
```
2. Pod Security Standards
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
```
```yaml
apiVersion: v1
kind: Pod
metadata:
name: secure-app
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
image: myapp:1.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "100m"
memory: "128Mi"
```
3. Secrets Management
```yaml
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: external-secrets
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets
target:
name: db-secret
data:
remoteRef:
key: prod/db/postgres
property: password
```
```bash
Encrypt secret before committing to Git
kubeseal --format yaml < secret.yaml > sealed-secret.yaml
Commit sealed-secret.yaml (safe to store in Git)
git add sealed-secret.yaml
git commit -m "Add encrypted database credentials"
```
4. RBAC (Role-Based Access Control)
```yaml
Role for developers (read-only)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: development
name: developer
rules:
resources: ["pods", "deployments", "jobs", "logs"]
verbs: ["get", "list", "watch"]
resources: ["pods/exec", "pods/portforward"]
verbs: ["create"] # Allow debugging
---
Bind role to users
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-binding
namespace: development
subjects:
name: developers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: developer
apiGroup: rbac.authorization.k8s.io
```
Cost Optimization
1. Right-Sizing Resources
```bash
helm install goldilocks fairwinds-stable/goldilocks \
--namespace goldilocks --create-namespace
Enable for namespace
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
View recommendations
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80
Open http://localhost:8080
```
2. Spot Instances for Stateless Workloads
```yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production
region: us-east-1
nodeGroups:
instancesDistribution:
instanceTypes: [t3.large, t3a.large, t2.large]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0 # 100% spot
spotAllocationStrategy: capacity-optimized
minSize: 5
maxSize: 100
labels:
workload-type: stateless
taints:
value: "true"
effect: "NoSchedule"
```
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
template:
spec:
tolerations:
operator: "Equal"
value: "true"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
operator: In
values: ["stateless"]
```
3. Cluster Resource Usage Monitoring
```bash
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost --create-namespace \
--set prometheus.server.global.external_labels.cluster_id=production
View dashboard
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090
```
4. Storage Optimization
```yaml
Fast SSD for databases
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3 # Latest generation SSD
iops: "3000"
throughput: "125"
volumeBindingMode: WaitForFirstConsumer
Cheap HDD for logs/backups
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow-hdd
provisioner: ebs.csi.aws.com
parameters:
type: sc1 # Cold HDD (cheapest)
volumeBindingMode: WaitForFirstConsumer
```
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: logs-archive
annotations:
Move to Glacier after 30 days
aws-ebs-csi-driver/lifecycle-policy: glacier-30d
spec:
storageClassName: slow-hdd
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Ti
```
Disaster Recovery & Business Continuity
Backup Strategies
```bash
Install Velero
velero install \
--provider aws \
--bucket k8s-backups \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1
Schedule daily backups
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--ttl 720h # Retain for 30 days
Backup specific namespace
velero backup create prod-backup --include-namespaces production
Restore from backup
velero restore create --from-backup prod-backup
```
Multi-Region Failover
```yaml
Primary cluster (us-east-1)
apiVersion: v1
kind: Service
metadata:
name: app
annotations:
external-dns.alpha.kubernetes.io/hostname: app.example.com
spec:
type: LoadBalancer
selector:
app: web-app
Secondary cluster (us-west-2) - standby
Use Route53 health checks to failover
```
Production Checklist
Pre-Deployment
Post-Deployment
Conclusion: Kubernetes Mastery Path
Kubernetes production excellence requires continuous learning:
Months 1-3: Foundations
Months 4-6: Production Patterns
Months 7-12: Advanced Topics
Year 2+: Expertise
Remember: Kubernetes is a journey, not a destination. The ecosystem evolves rapidly—stay curious, keep learning, and always test in staging before production.