Every Kubernetes tutorial teaches you how to deploy an Nginx pod and call it a day.
Nobody teaches you what happens when your monitoring is misconfigured and your node pool quietly runs out of memory. Nobody teaches you what OOMKilled means at 2am when a platform is down. Nobody teaches you the things you only learn by actually running Kubernetes in production.
That memory and monitoring failure is real. I have been there. It is the kind of incident that teaches you more than any tutorial ever will, and it is where this post starts.
The memory incident and what it actually taught me
A node pool running out of memory does not always look like an emergency until it is one. What you see first is degraded performance — things slowing down, response times climbing. Then pods start getting evicted. Then OOMKilled starts appearing in your logs.
OOMKilled means the Linux kernel killed a container because it exceeded its memory limit, or because the node itself ran out of memory and the kernel had to choose what to kill. When your monitoring is not configured to alert on this, you find out when users start reporting problems — not when the problem starts.
The lesson is not just "set memory limits." It is layered.
Resource limits are not optional
Every workload needs requests and limits set on both CPU and memory. requests is what Kubernetes uses to schedule a pod onto a node. limits is the ceiling the container cannot exceed.
If you set no memory limit, a single container can consume the entire node's memory. Every other pod on that node gets evicted or killed. Set limits based on what your application actually uses under real load, not what you guess it might need.
Readiness and liveness probes do different jobs
A pod in the Running state does not mean your application is serving traffic. It means the container process started. Those are different things.
Readiness probes tell Kubernetes when a pod is actually ready to receive traffic. Without them, a pod can be Running, registered behind a Service, and returning errors because the application inside it has not finished initialising.
Liveness probes tell Kubernetes when a pod needs to be restarted. A deadlocked application that is technically still running will sit there indefinitely without a liveness probe, appearing healthy while serving nothing useful.
Both probes should check something real — an HTTP endpoint that actually exercises the application, not just a process check.
Namespaces are organisation, not security
Pods in different namespaces on the same cluster can communicate with each other by default. A compromised workload in one namespace can reach services in another unless explicit NetworkPolicy rules prevent it.
If you need real isolation between workloads — separate environments, different clients, different risk profiles — that requires NetworkPolicy, or separate clusters entirely. Namespaces alone do not provide it.
etcd is the cluster. Back it up.
Everything about your Kubernetes cluster — every Deployment, every Secret, every ConfigMap — is stored in etcd. If etcd is lost without a backup, the cluster is gone.
Back up etcd regularly and test the restore. A backup you have never restored is a backup you do not actually have.
Monitoring is not optional — it is the job
The memory incident I described at the start was preventable. Monitoring that was properly configured would have alerted before the node was full, before pods started getting evicted, before users saw anything at all.
You need alerts on node memory and CPU utilisation, pod restart counts, failed deployments, and pending pods that cannot be scheduled. You need to know at 70% memory, not at 100%.
Prometheus and Grafana are the standard tooling. The implementation takes time to set up correctly. That time is worth spending before the incident, not after.