No sections found
Adjust your search query.
The Modern
Kubernetes Administrator
Forget kubeadm, deprecated Pod Security Policies, and legacy NGINX Ingress. The ecosystem has evolved. This is the massively expanded, battle-tested curriculum covering the Gateway API, eBPF (Cilium), Karpenter, and provisioning infrastructure for Agentic AI workloads.
Module 0: The 15-Minute Admin
Before diving into the sprawling architecture of the Control Plane, you need to experience what a Kubernetes Administrator actually does on a Tuesday morning. Theory is useless without practical muscle memory.
Scenario: The Web App is Down
You just received an alert. The frontend is returning 502 Bad Gateway. Here is the exact debugging loop.
-
Step 1: Discovery
First, check the state of the workloads in the namespace.
❯ kubectl get pods -n productionNAME READY STATUS RESTARTS AGE
webapp-deployment-7f89b-2a9xx 0/1 CrashLoopBackOff 12 45m
webapp-deployment-7f89b-9b4zz 0/1 CrashLoopBackOff 12 45m -
Step 2: Investigation (Events & Logs)
A pod crashing repeatedly means the Kubelet is trying to start it, but the container process exits. We check the events, then the previous logs.
❯ kubectl describe pod webapp-deployment-7f89b-2a9xx... (Scrolling down to Events) ...Warning BackOff 1m kubelet Back-off restarting failed containerWarning Unhealthy 2m kubelet Liveness probe failed: HTTP probe failed with statuscode: 500❯ kubectl logs webapp-deployment-7f89b-2a9xx --previousFATAL: Environment variable DB_PASSWORD is missing or empty. Shutting down... -
Step 3: The Fix & Rollout
You realize a `Secret` was accidentally modified via GitOps. You revert the change, apply it, and force the deployment to restart so it picks up the new environment variables.
❯ kubectl apply -f secrets.yamlsecret/webapp-db-secret configured❯ kubectl rollout restart deployment/webapp-deploymentdeployment.apps/webapp-deployment restarted
You just touched 80% of daily admin commands. Kubernetes administration is just an infinite loop of: Declare State → Check Status → Read Events → Fix Config → Re-apply. Do not let the complexity of the control plane intimidate you from the simplicity of the user plane.
Module 1: The Kubernetes Lexicon & Architecture Blueprint
Kubernetes introduces a massive amount of jargon. Before looking at the architecture graph, you must understand the vocabulary. This is the 2026 translation guide for day-to-day terminology.
The Jargon Decoder
The single entry point to the cluster. Every `kubectl` command, UI click, and internal component communicates strictly through the API Server.
A highly-available, distributed key-value store. It holds the absolute "Desired State" of the entire cluster.
Watches for newly created Pods that have no Node assigned. It evaluates CPU/Memory/Affinity rules and assigns the Pod to the best Worker Node.
A collection of infinite loops running in the background. It constantly compares the *Actual State* of the cluster against the *Desired State* in etcd. If you want 3 pods and only 2 are running, the Controller-Manager detects the drift and asks the API server to create a 3rd.
An agent running on every Worker Node. It listens to the API Server for instructions ("Start this Pod") and reports back the node's health and pod status.
Maintains network rules (historically via iptables) on nodes, allowing network communication to your Pods from inside or outside the cluster.
The actual software responsible for pulling the image and running the container. `containerd` and `CRI-O` are the modern standards.
The Architecture Blueprint
Below is the visual representation of how the User Interface and CLI interact with the Control Plane, which in turn manages the Worker Nodes.
Legacy Image Correction:
Older diagrams often label the runtime as "Docker". For absolute technical accuracy in 2026, the graph below correctly identifies the runtime as the CRI Runtime (containerd).
Module 2: Paradigms & Business Use Cases
Why incur the complexity of Kubernetes? It is not a silver bullet. You must understand where Kubernetes defeats traditional Bare Metal / VMs, and where it outscales Serverless architectures.
| Paradigm | Plain Server (VM/Bare Metal) | Serverless (Lambda/Cloud Run) | Kubernetes (Modern) |
|---|---|---|---|
| Architecture | Monolithic. Heavy OS footprint per app. | Event-driven. Ephemeral containers managed by vendor. | Declarative microservices. High-density packing on shared nodes. |
| Scaling | Manual or slow (booting a VM takes minutes). | Instant scaling to 0 or 10,000. Suffers from "Cold Starts". | Dynamic via HPA and Karpenter (seconds). Keeps warm pools. |
| Cost Efficiency | High baseline cost (paying for idle CPU). | Incredibly cheap for low traffic. Astronomically expensive at sustained high traffic. | Extremely efficient at scale. fractional sharing of CPUs/GPUs. |
| Best For | Legacy monolithic databases, raw performance, extreme compliance. | Startups, erratic traffic spikes, event-driven glue code. | SaaS platforms, AI/ML pipelines, multi-cloud portability, complex distributed systems. |
Real-World Business User Stories
Case 1: High-Traffic E-Commerce (Black Friday)
The User Story:
"As a CTO, I need the frontend and payment services to scale massively during the Black Friday traffic spike, but I don't want to pay for 1,000 servers for the other 364 days of the year. I also cannot afford downtime during updates."
Implementation Steps:
- Deploy microservices as independent K8s
Deployments. - Attach a
HorizontalPodAutoscaler (HPA)to the frontend deployment, triggering at 70% CPU utilization. - Install Karpenter to monitor for "Pending" pods. When HPA requests 50 new pods and the cluster runs out of room, Karpenter dynamically buys Spot Instances from AWS in seconds, injecting them into the cluster.
- Use the Gateway API to seamlessly route checkout traffic, while ArgoCD handles zero-downtime rolling updates if a patch is needed mid-sale.
Case 2: Heavy AI / ML Batch Processing
The User Story:
"As a Data Science lead, I need to run 10,000 asynchronous LLM fine-tuning jobs. GPUs are wildly expensive, so I need to queue these jobs, fractionally share the GPUs, and shut everything down immediately when finished."
Implementation Steps:
- Package the Python ML scripts into containers.
- Deploy them using Kubernetes
Jobsinstead of Deployments (since they need to run to completion and terminate, not run forever). - Utilize Kubernetes Dynamic Resource Allocation (DRA) to request specific GPU slices (e.g., 0.5 of an A100).
- Use KEDA (Kubernetes Event-driven Autoscaling) to monitor a Kafka queue length. KEDA creates a K8s Job for every message in the queue, processing massive datasets in parallel.
Module 3: Workloads & Native Sidecars
The Pod is the atomic unit of Kubernetes. It represents a single instance of a running process in your cluster. Pods can contain one or multiple containers that share the same Network namespace (localhost) and storage volumes.
Latest Change Native Sidecar Containers (v1.29+)
For a decade, running a "Sidecar" (like a logging agent, Datadog forwarder, or an Istio proxy alongside your main app) was a nightmare. If a Kubernetes Job finished its main code, the sidecar kept running forever, preventing the Pod from gracefully terminating. Modern Kubernetes solved this by adding `restartPolicy: Always` to `initContainers`.
Workload Controllers
You should almost never create a raw `Pod` via YAML. You create a Controller, which manages the lifecycle, scaling, and healing of Pods. If a node burns down, the controller detects the missing pods and reschedules them elsewhere.
-
Deployment
Used for stateless applications (Web servers, APIs). It manages a `ReplicaSet` to ensure X number of pods are running and provides zero-downtime rolling updates.
-
DaemonSet
Ensures that exactly one copy of a Pod runs on *every* Node in the cluster. Used for infrastructure agents (Fluentd, Cilium, Datadog).
-
StatefulSet
Used for stateful applications. Unlike Deployments where pods have random hashes (`web-7f89b`), StatefulSets provide predictable identities (`web-0`, `web-1`) and guarantee ordered scaling.
Scaling a StatefulSet down to 0 does not delete the underlying PersistentVolumeClaims (PVCs). This is a safety feature to prevent accidental data deletion. However, you must clean them up manually, otherwise you will incur lingering cloud storage costs for disks that are no longer attached to any compute.
Module 4: Configuration & Probes
The 12-Factor App methodology dictates a strict separation of configuration from code. Kubernetes provides ConfigMaps and Secrets, which can be injected into Pods as Environment Variables or mounted as literal files on the container's disk.
Modern Probe Tuning (The Lifeline)
Kubernetes does not automatically know if your application is "healthy"—it only knows if the Linux process PID 1 is running. A Java application might be running but deadlocked. You must configure Probes. Modern Kubernetes supports HTTP, TCP, Exec, and gRPC native probes.
Readiness Probe
"Am I ready to receive traffic?" If this fails, K8s removes the pod's IP from the Service load balancer. It does not restart the pod.
Liveness Probe
"Am I deadlocked?" If this fails continuously based on `failureThreshold`, K8s sends a SIGTERM, terminates the container, and restarts it.
Startup Probe
Used for slow-starting legacy apps. It disables Liveness/Readiness checks completely until the app has successfully booted once.
Never make your Liveness and Readiness probes identical. Never have your Liveness probe check a downstream database connection.
Imagine a brief database network blip. If your Liveness probe fails because the DB is unreachable, Kubernetes will restart all of your application pods simultaneously. When they boot back up, the DB is still unreachable, so they crash again. You have just turned a minor downstream database blip into a catastrophic, cascading application outage.
Solution: Readiness can check downstream (so traffic stops routing to a broken pod), but Liveness should only check if the local process is responsive (e.g., a simple `/healthz` returning 200).
Module 5: Modern Networking & Gateway API
The traditional Ingress API object was designed in 2015. It was extremely limited, assuming simple host-and-path HTTP routing. To do anything advanced (canary deployments, header matching, gRPC), you had to write unreadable NGINX annotations (nginx.ingress.kubernetes.io/rewrite-target).
The Gateway API Era
The Gateway API is the modern, expressive standard. It separates concerns across personas. The Platform Team manages the GatewayClass and Gateway, while Application Developers manage the HTTPRoute to govern their own microservices.
By default, Kubernetes configures /etc/resolv.conf in your pods with ndots:5. This means if your application (e.g., written in Node.js on Alpine Linux) tries to resolve an external API like api.stripe.com, the container will perform 5 sequential internal DNS lookups (appending .default.svc.cluster.local, etc.) before finally falling back to the public internet. This causes massive latency spikes on heavy external APIs.
Solution: Always use Fully Qualified Domain Names (FQDNs) ending with a dot in your application configurations (e.g., api.stripe.com.). The trailing dot tells the Linux resolver to skip the local search paths immediately.
Module 6: Storage & Databases in Containers
For years, the cardinal rule of DevOps was "Never run databases in Docker." Containers are ephemeral; databases demand stability, IPC tuning, and exact disk I/O guarantees. However, with the maturity of the Container Storage Interface (CSI), this has shifted.
The Absolute Rules of K8s Database Management
-
1
Default to Managed Cloud Services If you are on a major cloud provider, use RDS, Cloud SQL, or Neon. Do not manage stateful primary databases inside Kubernetes if you do not have a dedicated, 24/7 DBA team to handle split-brain recovery and point-in-time backups.
-
2
Never use raw StatefulSets for DBs If you must run Postgres/MySQL or Vector DBs (Milvus, Qdrant) in K8s (for compliance, edge, or extreme cost savings), never write the YAML yourself. Relying on K8s to simply "restart" a pod is not high availability. True failover requires leader election and WAL promotion. Only use battle-tested Operators (e.g., CloudNativePG, Percona). Operators encapsulate the "Day 2 human DBA logic" into code.
-
3
Beware Network-Attached Storage (EBS/PD) Cloud volumes like AWS EBS introduce network latency spikes that destroy database transaction speeds. For high-performance DBs in K8s, use local NVMe instance storage orchestrated by local volume managers (like TopoLVM) combined with the DB Operator's synchronous replication.
Module 7: Resource Management & Autoscaling
The difference between a cheap, highly performant cluster and an incredibly expensive, crashing cluster lies purely in how you define Requests and Limits.
Developers often set strict CPU `Limits` thinking it protects the node. Due to the Linux Completely Fair Scheduler (CFS) quota mechanisms used by K8s, setting strict CPU limits causes massive artificial CPU throttling. Your app will experience 500ms latency spikes even when the underlying Node is 90% idle.
Best Practice: Always set Memory Limits equal to Memory Requests (to prevent unexpected OOMKills), but generally leave CPU Limits undefined (or set them extremely high) unless you are running a strictly multi-tenant SaaS cluster. Rely on CPU Requests for scheduling.
Node Autoscaling: The Karpenter Revolution
The legacy Cluster-Autoscaler was group-based (Auto Scaling Groups). If a Pod asked for a GPU, the autoscaler had to scale up an entire group of GPU nodes, which was slow and expensive.
Karpenter is the modern standard. It is group-less and "Just-in-Time." It observes the pending Pods, calculates the exact right-sized compute instance needed to fit those specific pods (mixing Spot and On-Demand instances, Graviton and x86), and provisions the raw compute instantly directly from the cloud provider API.
Module 8: Zero-Trust Security & Governance
Kubernetes was built for velocity, not security. By default, it is highly permissive. Pods can talk to any other Pod. You must lock it down using Pod Security Admissions and Admission Controllers (like Kyverno or OPA Gatekeeper).
The Mandatory Security Context
A container breakout vulnerability running as `root` (UID 0) gives an attacker host-level root access to the Node. Every production deployment should have this exact security context:
Network Policies: Default Deny
By default, if the frontend is hacked, the attacker can ping the database directly. You must implement a "Default Deny" Network Policy per namespace, explicitly opening only required paths.
By default, Kubernetes mounts a highly privileged JWT Service Account Token into `/var/run/secrets/kubernetes.io/serviceaccount` on every single Pod. If your application suffers a Remote Code Execution (RCE) attack, the attacker will extract this token and use it to query the Kubernetes API directly to take over the cluster.
Fix: Always set automountServiceAccountToken: false in your Pod spec unless the application specifically needs to talk to the K8s API (like an operator or ingress controller).
Module 9: GitOps (ArgoCD & Flux)
Do not run kubectl apply from Jenkins or GitHub Actions. Giving your CI server admin credentials to your production cluster is a massive security risk. The modern standard is the GitOps "Pull" Paradigm.
-
1. Continuous Integration (Push)
GitHub Actions builds your application code, runs unit tests, builds the Docker Image, pushes it to an Image Registry (ECR/GCR), and finally, commits the new image tag (e.g., `v1.2.3`) to a separate Infrastructure Git Repository.
-
2. Continuous Deployment (Pull via ArgoCD)
An agent running inside the Kubernetes cluster (ArgoCD or FluxCD) detects the new commit in the Git Repository. It calculates the drift between the cluster and Git, and automatically pulls the new state into the cluster. If an admin manually edits a deployment via `kubectl edit`, ArgoCD will instantly overwrite it, enforcing Git as the absolute Single Source of Truth.
Module 10: eBPF & Observability
Troubleshooting microservices requires deep visibility. OpenTelemetry (OTel) is now the undisputed standard for tracing, effectively killing vendor-specific SDKs.
The eBPF Revolution (Cilium)
Historically, Kubernetes networking relied on kube-proxy manipulating thousands of iptables rules. This was incredibly inefficient at scale. Modern clusters replace kube-proxy entirely with Cilium.
Cilium uses eBPF (Extended Berkeley Packet Filter) to run sandboxed programs directly inside the Linux kernel. This allows for massive performance gains, zero-instrumentation network observability (via Hubble), and the ability to drop malicious packets before they ever reach the container network namespace.
For extreme security, you should build your Docker images using "Distroless" bases or "Scratch". These images contain absolutely zero OS tools—not even a shell (`/bin/sh`). If they get hacked, the attacker can't execute commands.
But how do you debug them? You use Ephemeral Containers. You can run kubectl debug -it <pod-name> --image=busybox --target=<container-name>. This attaches a temporary shell container to the running pod, sharing its process and network space so you can run `curl` or `strace` without compromising the production image.
Module 11: Agentic AI Workloads
The FrontierThe dominant workload shift of 2026 is hosting Large Language Models (LLMs) and autonomous AI Agents. Kubernetes handles this via Dynamic Resource Allocation (DRA), enabling complex fractional GPU sharing via frameworks like vLLM and Ray.
1. Strict Sandboxing
If your AI Agent writes and executes code (e.g., LangChain/AutoGPT data analysis), it must run in a sandboxed runtime. Use gVisor or Kata Containers. If the LLM is prompt-injected and goes rogue, the sandbox prevents it from executing host-level kernel attacks.
2. Asynchronous Scaling (KEDA)
Agents spend 95% of their time waiting for external OpenAI/Anthropic API responses. Do not scale them on CPU/Memory using HPA. Scale them based on task queue depth (using KEDA + Redis/Kafka) to drastically reduce cloud costs.
Standard Kubernetes Services use round-robin load balancing. If you have an AI agent that holds conversation context in memory (a stateful conversational agent), request 2 might hit a different pod than request 1, losing all context. You must use the Gateway API for Header-based routing (Sticky Sessions) or rely on StatefulSets to route multi-step tasks to specific pods reliably.