...
Cloud & DevOps

Kubernetes in Production: Complete DevOps Guide to Deployment & Scaling (2025)

Complete guide to production-grade Kubernetes deployments. Learn cluster architecture, CI/CD automation, monitoring, security hardening, and cost optimization strategies.

TT
TEELI Team
TEAM
DevOps & Cloud Engineering Specialists
Jan 15, 2025
13 min read
Share this article:
Kubernetes production cluster architecture showing container orchestration pods services ingress load balancing and auto-scaling for cloud-native DevOps deployment 2025

Kubernetes in Production: The Complete DevOps Playbook



This guide covers everything needed to run Kubernetes reliably at scale—from cluster design to disaster recovery.



Why Kubernetes Won the Container War


The Pre-Kubernetes Era


2013-2015: Container Chaos
  • Manual container deployments with Docker CLI
  • Shell scripts for orchestration (fragile, non-portable)
  • No standardized networking or service discovery
  • Difficult to scale across multiple hosts

  • Early Orchestrators:
  • Docker Swarm: Simple but limited scalability
  • Apache Mesos: Powerful but complex to operate
  • AWS ECS: Proprietary, vendor lock-in

  • Kubernetes Advantages


    Real-World Impact:
  • Spotify: Reduced deployment time from hours to 15 minutes
  • Pinterest: 80% reduction in infrastructure costs
  • Airbnb: Scaled from 100 to 1,000 microservices seamlessly

  • Kubernetes architecture diagram showing control plane components API server scheduler controller manager etcd and worker nodes with kubelet container runtime pods and services 2025

    Kubernetes Architecture Deep Dive


    Control Plane Components


    1. API Server (kube-apiserver)
  • Role: Frontend for Kubernetes control plane
  • Functions:
  • Validates and processes REST API requests
  • Updates etcd with cluster state
  • Serves as authentication/authorization gateway
  • Production Best Practices:
  • Run 3-5 replicas for high availability
  • Enable audit logging (track all API calls)
  • Implement rate limiting to prevent abuse

  • 2. etcd
  • Role: Distributed key-value store (cluster's source of truth)
  • Stores: All cluster configuration, state, and metadata
  • Production Best Practices:
  • Deploy 5-node etcd cluster (tolerates 2 failures)
  • Use separate etcd cluster for large deployments (>100 nodes)
  • Enable encryption at rest for sensitive data
  • Automated backups every hour to S3/GCS

  • Example: etcd Backup Script

    ```bash

    #!/bin/bash

    ETCDCTL_API=3 etcdctl snapshot save \

    --endpoints=https://127.0.0.1:2379 \

    --cacert=/etc/kubernetes/pki/etcd/ca.crt \

    --cert=/etc/kubernetes/pki/etcd/server.crt \

    --key=/etc/kubernetes/pki/etcd/server.key \

    /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db


    Upload to S3

    aws s3 cp /backup/etcd-snapshot-*.db s3://k8s-backups/etcd/


    Retain only last 30 days

    find /backup -name "etcd-snapshot-*.db" -mtime +30 -delete

    ```


    3. Scheduler (kube-scheduler)
  • Role: Assigns pods to nodes based on resource availability
  • Factors Considered:
  • CPU/Memory requests and limits
  • Node affinity/anti-affinity rules
  • Taints and tolerations
  • Pod topology spread constraints
  • Production Optimization:
  • Custom schedulers for GPU workloads (NVIDIA GPU Operator)
  • Priority classes for critical workloads
  • Pod disruption budgets to prevent mass evictions

  • 4. Controller Manager (kube-controller-manager)
  • Role: Runs control loops to maintain desired state
  • Controllers:
  • ReplicaSet Controller: Ensures pod replica count
  • Deployment Controller: Manages rolling updates
  • Service Controller: Creates LoadBalancers in cloud providers
  • Node Controller: Monitors node health

  • Worker Node Components


    1. Kubelet
  • Role: Agent running on each node
  • Functions:
  • Receives pod specs from API server
  • Ensures containers are running in pods
  • Reports node/pod status back to control plane
  • Production Configuration:
  • ```yaml

    /var/lib/kubelet/config.yaml

    apiVersion: kubelet.config.k8s.io/v1beta1

    kind: KubeletConfiguration

    maxPods: 110 # Default, adjust based on node size

    podPidsLimit: 4096

    containerLogMaxSize: "10Mi"

    containerLogMaxFiles: 5

    evictionHard:

    memory.available: "200Mi"

    nodefs.available: "10%"

    ```


    2. Container Runtime
  • Options: containerd (default), CRI-O, Docker (deprecated)
  • containerd (Recommended):
  • Lightweight, CRI-native
  • Lower resource overhead vs Docker
  • Direct integration with Kubernetes

  • 3. kube-proxy
  • Role: Network proxy on each node
  • Functions: Implements Service abstraction (load balancing)
  • Modes:
  • iptables (default): Simple, works everywhere
  • IPVS: Better performance for large clusters (>1000 services)
  • eBPF (Cilium): Highest performance, advanced features

  • Kubernetes cluster components showing control plane with API server etcd scheduler controller manager and worker nodes with kubelet kube-proxy container runtime pods 2025

    Production Cluster Architecture Patterns


    Pattern 1: Multi-Tier Cluster


    Node Pools by Workload:

    ```yaml

    System workloads (kube-system, monitoring)

    apiVersion: v1

    kind: Node

    metadata:

    labels:

    node-role: system

    node-type: on-demand

    spec:

    taints:

  • key: "dedicated"
  • value: "system"

    effect: "NoSchedule"


    Stateful workloads (databases, caches)

    apiVersion: v1

    kind: Node

    metadata:

    labels:

    node-role: stateful

    storage: ssd

    node-type: on-demand # Never use spot for stateful!


    Stateless apps (web servers, APIs)

    apiVersion: v1

    kind: Node

    metadata:

    labels:

    node-role: stateless

    node-type: spot # Save 70% with spot instances

    ```


    Benefits:
  • Isolate critical system components
  • Optimize costs with spot instances for stateless workloads
  • Dedicated high-performance nodes for databases

  • Pattern 2: Multi-Cluster Strategy


    When to Use Multiple Clusters:
  • 1
    Geographic Distribution: Low-latency for global users
  • 2
    Environment Isolation: Separate dev/staging/prod clusters
  • 3
    Compliance: Data residency requirements (GDPR)
  • 4
    Blast Radius Reduction: Limit impact of cluster failures
  • 5
    Multi-Tenancy: Hard isolation between teams/customers

  • Cluster Federation Tools:
  • Karmada: Multi-cluster orchestration (CNCF project)
  • Rancher: Multi-cluster management UI
  • ArgoCD + ApplicationSets: GitOps across clusters
  • Istio Multi-Cluster: Service mesh spanning clusters

  • Pattern 3: Hybrid Cloud Architecture


    Example: AWS + On-Premises

    ```yaml

    Cluster on AWS EKS (public cloud)

    apiVersion: v1

    kind: Cluster

    metadata:

    name: production-aws

    spec:

    region: us-east-1

    nodeGroups:

  • name: web-tier
  • instanceTypes: [t3.large, t3a.large]

    scaling:

    min: 5

    max: 50


    Cluster on-premises (sensitive data)

    apiVersion: v1

    kind: Cluster

    metadata:

    name: production-onprem

    spec:

    controlPlaneEndpoint: 10.0.0.10:6443

    workloads:

  • databases # Keep data on-premises for compliance
  • payment-processing
  • ```


    Cross-Cluster Communication:
  • Submariner: Direct pod-to-pod networking across clusters
  • Cloud VPN: Site-to-site VPN (AWS-to-on-prem)
  • Service Mesh: Istio multi-cluster mode

  • Multi-cluster Kubernetes architecture showing primary cluster in AWS secondary in GCP and on-premises cluster with service mesh interconnection and federated workload distribution 2025

    CI/CD Pipelines for Kubernetes


    GitOps Workflow


    Philosophy: Git as the single source of truth for infrastructure and applications

    Architecture:
  • 1
    Developers commit code to Git
  • 2
    CI pipeline builds Docker image, pushes to registry
  • 3
    Automated tool updates Kubernetes manifests in Git
  • 4
    GitOps operator (ArgoCD/Flux) detects changes and applies to cluster
  • 5
    Kubernetes converges to desired state

  • Example: ArgoCD Application

    ```yaml

    apiVersion: argoproj.io/v1alpha1

    kind: Application

    metadata:

    name: web-app-production

    namespace: argocd

    spec:

    project: default

    source:

    repoURL: https://github.com/company/k8s-manifests

    targetRevision: main

    path: apps/web-app/overlays/production

    destination:

    server: https://kubernetes.default.svc

    namespace: web-app

    syncPolicy:

    automated:

    prune: true # Delete resources not in Git

    selfHeal: true # Revert manual changes

    syncOptions:

  • CreateNamespace=true
  • ```


    Benefits of GitOps:
  • Audit Trail: Every change tracked in Git history
  • Rollback: `git revert` to undo deployment
  • Disaster Recovery: Re-create entire cluster from Git
  • Multi-Environment: Separate branches/directories for dev/staging/prod

  • Complete CI/CD Pipeline


    GitHub Actions Example:

    ```yaml

    name: Deploy to Kubernetes


    on:

    push:

    branches: [main]


    jobs:

    build-and-deploy:

    runs-on: ubuntu-latest

    steps:

  • name: Checkout code
  • uses: actions/checkout@v3


  • name: Build Docker image
  • run: |

    docker build -t myapp:${{ github.sha }} .

    docker tag myapp:${{ github.sha }} myapp:latest


  • name: Run security scan
  • uses: aquasecurity/trivy-action@master

    with:

    image-ref: myapp:${{ github.sha }}

    severity: 'CRITICAL,HIGH'


  • name: Push to Docker registry
  • run: |

    echo "${{ secrets.DOCKER_PASSWORD }}" | docker login -u "${{ secrets.DOCKER_USERNAME }}" --password-stdin

    docker push myapp:${{ github.sha }}

    docker push myapp:latest


  • name: Update Kubernetes manifests
  • run: |

    cd k8s-manifests

    kustomize edit set image myapp=myapp:${{ github.sha }}

    git add .

    git commit -m "Update image to ${{ github.sha }}"

    git push


    ArgoCD automatically detects Git change and deploys

    ```


    Jenkins Pipeline (Alternative):

    ```groovy

    pipeline {

    agent { label 'docker' }


    stages {

    stage('Build') {

    steps {

    sh 'docker build -t myapp:${BUILD_NUMBER} .'

    }

    }


    stage('Test') {

    steps {

    sh 'docker run myapp:${BUILD_NUMBER} npm test'

    }

    }


    stage('Security Scan') {

    steps {

    sh 'trivy image myapp:${BUILD_NUMBER}'

    }

    }


    stage('Deploy') {

    when {

    branch 'main'

    }

    steps {

    withKubeConfig([credentialsId: 'k8s-prod']) {

    sh '''

    kubectl set image deployment/myapp \

    myapp=myapp:${BUILD_NUMBER} \

    --namespace=production


    kubectl rollout status deployment/myapp \

    --namespace=production \

    --timeout=5m

    '''

    }

    }

    }

    }


    post {

    failure {

    slackSend(channel: '#deployments', color: 'danger',

    message: "Deployment failed: ${env.JOB_NAME} #${env.BUILD_NUMBER}")

    }

    }

    }

    ```


    GitOps CI CD pipeline diagram showing Git commit Docker build image scan registry push ArgoCD sync and Kubernetes deployment with automated rollback capabilities 2025

    Auto-Scaling Strategies


    1. Horizontal Pod Autoscaler (HPA)


    Scale based on metrics:

    ```yaml

    apiVersion: autoscaling/v2

    kind: HorizontalPodAutoscaler

    metadata:

    name: web-app-hpa

    spec:

    scaleTargetRef:

    apiVersion: apps/v1

    kind: Deployment

    name: web-app

    minReplicas: 3

    maxReplicas: 100

    metrics:

  • type: Resource
  • resource:

    name: cpu

    target:

    type: Utilization

    averageUtilization: 70

  • type: Resource
  • resource:

    name: memory

    target:

    type: Utilization

    averageUtilization: 80

  • type: Pods
  • pods:

    metric:

    name: http_requests_per_second

    target:

    type: AverageValue

    averageValue: "1000"

    behavior:

    scaleDown:

    stabilizationWindowSeconds: 300 # Wait 5min before scaling down

    policies:

  • type: Percent
  • value: 50 # Scale down max 50% of pods at once

    periodSeconds: 60

    scaleUp:

    stabilizationWindowSeconds: 0 # Scale up immediately

    policies:

  • type: Percent
  • value: 100 # Double pods if needed

    periodSeconds: 15

    ```


    Custom Metrics with Prometheus:

    ```yaml

    Install Prometheus Adapter first

    apiVersion: autoscaling/v2

    kind: HorizontalPodAutoscaler

    metadata:

    name: app-hpa-custom

    spec:

    scaleTargetRef:

    apiVersion: apps/v1

    kind: Deployment

    name: payment-api

    minReplicas: 5

    maxReplicas: 50

    metrics:

  • type: External
  • external:

    metric:

    name: sqs_queue_length # AWS SQS queue depth

    selector:

    matchLabels:

    queue_name: payments

    target:

    type: AverageValue

    averageValue: "30" # Scale if queue > 30 msgs/pod

    ```


    2. Vertical Pod Autoscaler (VPA)


    Automatically adjust CPU/memory requests:

    ```yaml

    apiVersion: autoscaling.k8s.io/v1

    kind: VerticalPodAutoscaler

    metadata:

    name: db-vpa

    spec:

    targetRef:

    apiVersion: apps/v1

    kind: StatefulSet

    name: postgres

    updatePolicy:

    updateMode: "Auto" # Recreate pods with new resources

    resourcePolicy:

    containerPolicies:

  • containerName: postgres
  • minAllowed:

    cpu: 1

    memory: 2Gi

    maxAllowed:

    cpu: 8

    memory: 32Gi

    controlledResources: ["cpu", "memory"]

    ```


    Use Cases:
  • Databases with variable workloads
  • Batch processing jobs
  • Prevent over/under-provisioning

  • 3. Cluster Autoscaler


    Add/remove nodes based on pending pods:

    ```yaml

    apiVersion: apps/v1

    kind: Deployment

    metadata:

    name: cluster-autoscaler

    namespace: kube-system

    spec:

    replicas: 1

    template:

    spec:

    containers:

  • image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
  • name: cluster-autoscaler

    command:

  • ./cluster-autoscaler
  • --cloud-provider=aws
  • --namespace=kube-system
  • --nodes=3:20:worker-pool-1 # min:max:nodegroup
  • --scale-down-enabled=true
  • --scale-down-delay-after-add=10m
  • --scale-down-unneeded-time=10m
  • --skip-nodes-with-local-storage=false
  • env:

  • name: AWS_REGION
  • value: us-east-1

    ```


    Best Practices:
  • Set realistic min/max node counts
  • Use pod disruption budgets to prevent disruptions
  • Monitor cluster autoscaler logs for scaling events
  • Combine with HPA for complete auto-scaling

  • 4. KEDA (Kubernetes Event-Driven Autoscaling)


    Scale based on external events:

    ```yaml

    apiVersion: keda.sh/v1alpha1

    kind: ScaledObject

    metadata:

    name: kafka-consumer-scaler

    spec:

    scaleTargetRef:

    name: event-processor

    minReplicaCount: 1

    maxReplicaCount: 100

    triggers:

  • type: kafka
  • metadata:

    bootstrapServers: kafka:9092

    consumerGroup: my-group

    topic: events

    lagThreshold: "50" # Scale if lag > 50 messages

  • type: cron # Scale up before traffic spike
  • metadata:

    timezone: America/New_York

    start: 0 8 * * * # 8 AM daily

    end: 0 18 * * * # 6 PM daily

    desiredReplicas: "20"

    ```


    Kubernetes auto-scaling layers showing HPA for pod scaling VPA for resource optimization cluster autoscaler for node management and KEDA for event-driven scaling 2025

    Monitoring & Observability


    The Three Pillars


    1. Metrics (Prometheus + Grafana)

    Install Prometheus Stack:

    ```bash

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

    helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \

    --namespace monitoring --create-namespace \

    --set prometheus.prometheusSpec.retention=30d \

    --set grafana.adminPassword=SecurePassword123

    ```


    Key Metrics to Monitor:
  • Cluster Health: Node CPU/memory, disk usage
  • Pod Health: Restart count, pod status (Pending, CrashLoopBackOff)
  • Application: Request rate, error rate, duration (RED method)
  • Resource Saturation: CPU throttling, OOM kills

  • Example: Custom Prometheus Alert

    ```yaml

    apiVersion: monitoring.coreos.com/v1

    kind: PrometheusRule

    metadata:

    name: app-alerts

    spec:

    groups:

  • name: application
  • rules:

  • alert: HighErrorRate
  • expr: |

    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

    / sum(rate(http_requests_total[5m])) by (service) > 0.05

    for: 5m

    annotations:

    summary: "High error rate detected: {{ $labels.service }}"

    labels:

    severity: critical


  • alert: PodCrashLooping
  • expr: rate(kube_pod_container_status_restarts_total[15m]) > 0

    for: 15m

    annotations:

    summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} crash looping"

    ```


    2. Logs (ELK / Loki)

    Loki (Lightweight, integrates with Grafana):

    ```bash

    helm install loki grafana/loki-stack \

    --namespace monitoring \

    --set grafana.enabled=false \

    --set promtail.enabled=true

    ```


    Log Aggregation Pattern:
  • Promtail: Collects logs from all pods
  • Loki: Stores and indexes logs
  • Grafana: Query and visualize logs

  • Example: Query Pod Logs in Grafana

    ```logql

    {namespace="production", app="web-app"} |= "error" | json | level="error"

    ```


    3. Traces (Jaeger / Tempo)

    Distributed Tracing for Microservices:

    ```yaml

    apiVersion: apps/v1

    kind: Deployment

    metadata:

    name: jaeger

    spec:

    template:

    spec:

    containers:

  • name: jaeger
  • image: jaegertracing/all-in-one:latest

    env:

  • name: COLLECTOR_ZIPKIN_HTTP_PORT
  • value: "9411"

    ports:

  • containerPort: 16686 # Jaeger UI
  • containerPort: 14268 # Collector
  • ```


    Instrument App with OpenTelemetry:

    ```python

    from opentelemetry import trace

    from opentelemetry.exporter.jaeger.thrift import JaegerExporter

    from opentelemetry.sdk.trace import TracerProvider

    from opentelemetry.sdk.trace.export import BatchSpanProcessor


    Configure tracer

    jaeger_exporter = JaegerExporter(

    agent_host_name="jaeger",

    agent_port=6831,

    )


    trace.set_tracer_provider(TracerProvider())

    trace.get_tracer_provider().add_span_processor(

    BatchSpanProcessor(jaeger_exporter)

    )


    tracer = trace.get_tracer(__name__)


    Trace function

    @app.route("/api/order")

    def create_order():

    with tracer.start_as_current_span("create_order"):

    Business logic

    with tracer.start_as_current_span("db_query"):

    order = db.insert_order()


    with tracer.start_as_current_span("send_email"):

    send_confirmation_email(order)


    return order

    ```


    Kubernetes observability stack showing Prometheus for metrics Loki for logs Jaeger for traces with Grafana dashboard integration for unified monitoring 2025

    Security Hardening


    1. Network Policies


    Deny all traffic by default:

    ```yaml

    apiVersion: networking.k8s.io/v1

    kind: NetworkPolicy

    metadata:

    name: default-deny-all

    namespace: production

    spec:

    podSelector: {}

    policyTypes:

  • Ingress
  • Egress
  • ```


    Allow specific traffic:

    ```yaml

    apiVersion: networking.k8s.io/v1

    kind: NetworkPolicy

    metadata:

    name: web-app-policy

    spec:

    podSelector:

    matchLabels:

    app: web-app

    policyTypes:

  • Ingress
  • Egress
  • ingress:

  • from:
  • podSelector:
  • matchLabels:

    app: nginx-ingress

    ports:

  • protocol: TCP
  • port: 8080

    egress:

  • to:
  • podSelector:
  • matchLabels:

    app: postgres

    ports:

  • protocol: TCP
  • port: 5432

  • to: # Allow DNS
  • namespaceSelector:
  • matchLabels:

    name: kube-system

    ports:

  • protocol: UDP
  • port: 53

    ```


    2. Pod Security Standards


    Enforce restricted pod security:

    ```yaml

    apiVersion: v1

    kind: Namespace

    metadata:

    name: production

    labels:

    pod-security.kubernetes.io/enforce: restricted

    pod-security.kubernetes.io/audit: restricted

    pod-security.kubernetes.io/warn: restricted

    ```


    Secure Pod Spec:

    ```yaml

    apiVersion: v1

    kind: Pod

    metadata:

    name: secure-app

    spec:

    securityContext:

    runAsNonRoot: true

    runAsUser: 1000

    fsGroup: 2000

    seccompProfile:

    type: RuntimeDefault

    containers:

  • name: app
  • image: myapp:1.0

    securityContext:

    allowPrivilegeEscalation: false

    readOnlyRootFilesystem: true

    capabilities:

    drop: ["ALL"]

    resources:

    limits:

    cpu: "1"

    memory: "512Mi"

    requests:

    cpu: "100m"

    memory: "128Mi"

    ```


    3. Secrets Management


    External Secrets Operator (AWS Secrets Manager):

    ```yaml

    apiVersion: external-secrets.io/v1beta1

    kind: SecretStore

    metadata:

    name: aws-secrets

    spec:

    provider:

    aws:

    service: SecretsManager

    region: us-east-1

    auth:

    jwt:

    serviceAccountRef:

    name: external-secrets


    ---

    apiVersion: external-secrets.io/v1beta1

    kind: ExternalSecret

    metadata:

    name: db-credentials

    spec:

    refreshInterval: 1h

    secretStoreRef:

    name: aws-secrets

    target:

    name: db-secret

    data:

  • secretKey: password
  • remoteRef:

    key: prod/db/postgres

    property: password

    ```


    Sealed Secrets (GitOps-Friendly):

    ```bash

    Encrypt secret before committing to Git

    kubeseal --format yaml < secret.yaml > sealed-secret.yaml


    Commit sealed-secret.yaml (safe to store in Git)

    git add sealed-secret.yaml

    git commit -m "Add encrypted database credentials"

    ```


    4. RBAC (Role-Based Access Control)


    Principle of Least Privilege:

    ```yaml

    Role for developers (read-only)

    apiVersion: rbac.authorization.k8s.io/v1

    kind: Role

    metadata:

    namespace: development

    name: developer

    rules:

  • apiGroups: ["", "apps", "batch"]
  • resources: ["pods", "deployments", "jobs", "logs"]

    verbs: ["get", "list", "watch"]

  • apiGroups: [""]
  • resources: ["pods/exec", "pods/portforward"]

    verbs: ["create"] # Allow debugging


    ---

    Bind role to users

    apiVersion: rbac.authorization.k8s.io/v1

    kind: RoleBinding

    metadata:

    name: developer-binding

    namespace: development

    subjects:

  • kind: Group
  • name: developers

    apiGroup: rbac.authorization.k8s.io

    roleRef:

    kind: Role

    name: developer

    apiGroup: rbac.authorization.k8s.io

    ```


    Kubernetes security architecture showing network policies pod security RBAC secrets management and runtime security with Falco for threat detection 2025

    Cost Optimization


    1. Right-Sizing Resources


    Problem: Over-provisioning wastes 30-50% of cloud costs

    Solution: Goldilocks (VPA Recommendations)

    ```bash

    helm install goldilocks fairwinds-stable/goldilocks \

    --namespace goldilocks --create-namespace


    Enable for namespace

    kubectl label namespace production goldilocks.fairwinds.com/enabled=true


    View recommendations

    kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

    Open http://localhost:8080

    ```


    2. Spot Instances for Stateless Workloads


    AWS Spot Instance Node Group:

    ```yaml

    apiVersion: eksctl.io/v1alpha5

    kind: ClusterConfig

    metadata:

    name: production

    region: us-east-1

    nodeGroups:

  • name: stateless-spot
  • instancesDistribution:

    instanceTypes: [t3.large, t3a.large, t2.large]

    onDemandBaseCapacity: 0

    onDemandPercentageAboveBaseCapacity: 0 # 100% spot

    spotAllocationStrategy: capacity-optimized

    minSize: 5

    maxSize: 100

    labels:

    workload-type: stateless

    taints:

  • key: "spot"
  • value: "true"

    effect: "NoSchedule"

    ```


    Pod Tolerations:

    ```yaml

    apiVersion: apps/v1

    kind: Deployment

    metadata:

    name: web-app

    spec:

    template:

    spec:

    tolerations:

  • key: "spot"
  • operator: "Equal"

    value: "true"

    effect: "NoSchedule"

    affinity:

    nodeAffinity:

    requiredDuringSchedulingIgnoredDuringExecution:

    nodeSelectorTerms:

  • matchExpressions:
  • key: workload-type
  • operator: In

    values: ["stateless"]

    ```


    3. Cluster Resource Usage Monitoring


    Kubecost (Cost Allocation):

    ```bash

    helm install kubecost kubecost/cost-analyzer \

    --namespace kubecost --create-namespace \

    --set prometheus.server.global.external_labels.cluster_id=production


    View dashboard

    kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090

    ```


    Insights Provided:
  • Cost per namespace/deployment/pod
  • Idle resource recommendations
  • Rightsizing suggestions
  • Spot vs on-demand cost breakdown

  • 4. Storage Optimization


    Use appropriate storage classes:

    ```yaml

    Fast SSD for databases

    apiVersion: storage.k8s.io/v1

    kind: StorageClass

    metadata:

    name: fast-ssd

    provisioner: ebs.csi.aws.com

    parameters:

    type: gp3 # Latest generation SSD

    iops: "3000"

    throughput: "125"

    volumeBindingMode: WaitForFirstConsumer


    Cheap HDD for logs/backups

    apiVersion: storage.k8s.io/v1

    kind: StorageClass

    metadata:

    name: slow-hdd

    provisioner: ebs.csi.aws.com

    parameters:

    type: sc1 # Cold HDD (cheapest)

    volumeBindingMode: WaitForFirstConsumer

    ```


    Lifecycle Policies:

    ```yaml

    apiVersion: v1

    kind: PersistentVolumeClaim

    metadata:

    name: logs-archive

    annotations:

    Move to Glacier after 30 days

    aws-ebs-csi-driver/lifecycle-policy: glacier-30d

    spec:

    storageClassName: slow-hdd

    accessModes: [ReadWriteOnce]

    resources:

    requests:

    storage: 1Ti

    ```


    Kubernetes cost optimization dashboard showing resource utilization spot instance savings right-sizing recommendations and storage tier distribution with monthly cost trends 2025

    Disaster Recovery & Business Continuity


    Backup Strategies


    Velero (Cluster Backup):

    ```bash

    Install Velero

    velero install \

    --provider aws \

    --bucket k8s-backups \

    --secret-file ./credentials-velero \

    --backup-location-config region=us-east-1 \

    --snapshot-location-config region=us-east-1


    Schedule daily backups

    velero schedule create daily-backup \

    --schedule="0 2 * * *" \

    --ttl 720h # Retain for 30 days


    Backup specific namespace

    velero backup create prod-backup --include-namespaces production


    Restore from backup

    velero restore create --from-backup prod-backup

    ```


    What Velero Backs Up:
  • All Kubernetes resources (Deployments, Services, ConfigMaps)
  • Persistent Volume data (via snapshot APIs)
  • Namespace configuration

  • Multi-Region Failover


    Active-Passive Setup:

    ```yaml

    Primary cluster (us-east-1)

    apiVersion: v1

    kind: Service

    metadata:

    name: app

    annotations:

    external-dns.alpha.kubernetes.io/hostname: app.example.com

    spec:

    type: LoadBalancer

    selector:

    app: web-app


    Secondary cluster (us-west-2) - standby

    Use Route53 health checks to failover

    ```


    Active-Active (Multi-Region):
  • Database Replication: PostgreSQL streaming replication, MySQL Group Replication
  • Object Storage: S3 cross-region replication
  • Traffic Management: AWS Route53 geolocation routing, Cloudflare Load Balancing
  • Data Consistency: Eventual consistency, conflict resolution strategies

  • Production Checklist


    Pre-Deployment

  • [ ] Resource limits defined for all pods
  • [ ] Liveness and readiness probes configured
  • [ ] Pod disruption budgets set
  • [ ] Network policies enforced
  • [ ] Secrets stored in external secret manager
  • [ ] RBAC roles follow least privilege
  • [ ] Container images scanned for vulnerabilities (Trivy, Snyk)
  • [ ] Horizontal Pod Autoscaler configured
  • [ ] Monitoring and alerting set up (Prometheus, Grafana)
  • [ ] Logging aggregation configured (Loki, ELK)
  • [ ] Backup strategy tested (Velero)

  • Post-Deployment

  • [ ] Verify all pods running (`kubectl get pods`)
  • [ ] Check autoscaling behavior under load
  • [ ] Test disaster recovery (restore from backup)
  • [ ] Validate security policies (network isolation, RBAC)
  • [ ] Run chaos engineering tests (Chaos Mesh, Litmus)
  • [ ] Document runbooks for common incidents
  • [ ] Conduct post-deployment review

  • Conclusion: Kubernetes Mastery Path


    Kubernetes production excellence requires continuous learning:


    Months 1-3: Foundations

  • Complete CKAD/CKA certifications
  • Deploy personal projects to managed Kubernetes (GKE, EKS)
  • Master kubectl, YAML manifests, Helm charts

  • Months 4-6: Production Patterns

  • Implement GitOps with ArgoCD
  • Set up comprehensive monitoring (Prometheus, Grafana, Loki)
  • Practice incident response and debugging

  • Months 7-12: Advanced Topics

  • Multi-cluster management
  • Service mesh (Istio, Linkerd)
  • Advanced security (OPA/Gatekeeper policies, Falco runtime security)
  • Cost optimization strategies

  • Year 2+: Expertise

  • Contribute to CNCF projects
  • Run chaos engineering experiments
  • Architect multi-cloud Kubernetes platforms
  • Pursue CKS (Certified Kubernetes Security Specialist)

  • Remember: Kubernetes is a journey, not a destination. The ecosystem evolves rapidly—stay curious, keep learning, and always test in staging before production.


    FAQ — People Also Ask