Running Kubernetes in Production: Lessons From Real Clusters

Observability, scaling, upgrades, and operational discipline — what we learned running K8s for fintech, healthcare, and e-commerce.

Kubernetes in production is a different beast than Kubernetes in a lab. We've run clusters for fintech, healthcare, and e-commerce — and the lessons are consistent: observability, upgrade discipline, and resource hygiene make or break you.

This isn't a tutorial on deploying your first pod. It's what we wish we'd known before taking real traffic on K8s.

Observability before you need it

When something breaks at 2 a.m., you need logs, metrics, and traces that actually tell you what's wrong. That means instrumenting before go-live. Structured logging (JSON, with correlation IDs), distributed tracing across services, and metrics that map to both infrastructure and business events.

We standardize on the three pillars: Prometheus (or a managed equivalent) for metrics, a centralized log aggregator (Loki, Elasticsearch, or a vendor), and OpenTelemetry for traces. The moment you add a new service, it gets the same treatment. No exceptions.

Resource requests and limits on every pod — avoid best-effort scheduling
Liveness and readiness probes that reflect real health, not just process existence
PodDisruptionBudgets for anything that can't tolerate voluntary evictions
Network policies to restrict pod-to-pod traffic by default

Scaling that doesn't surprise you

Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler are table stakes, but they only work if your application scales horizontally and your metrics are correct. We've seen HPAs scale on CPU while the real bottleneck was I/O or an external API. Custom metrics — queue depth, error rate, latency percentiles — often drive better scaling decisions.

Load test before production. Know your pod's capacity, set appropriate min/max replicas, and run chaos experiments (e.g., kill a node, drain a zone) in staging so you're not debugging scaling behavior during an incident.

Upgrades without drama

Cluster upgrades are inevitable. Skipping minor versions leads to big jumps and higher risk. We upgrade one minor version at a time, always in non-production first, with a rollback plan and a maintenance window for control plane changes.

Node pools get rotated: new nodes join with the new K8s version, workloads drain off old nodes, then we decommission the old pool. This avoids in-place node upgrades and keeps disruption predictable.

The operational checklist

Production K8s demands discipline: RBAC locked down, secrets in a vault or external secrets operator, image scanning in CI, and a runbook for every critical path. The clusters that run smoothly are the ones where the team treats the platform as a product — documented, tested, and iterated on.

Kubernetes gives you power; it doesn't give you safety. That part you build.

Observability before you need it

Resource requests and limits on every pod — avoid best-effort scheduling

Liveness and readiness probes that reflect real health, not just process existence

PodDisruptionBudgets for anything that can't tolerate voluntary evictions

Network policies to restrict pod-to-pod traffic by default

Scaling that doesn't surprise you

Upgrades without drama

Node pools get rotated: new nodes join with the new K8s version, workloads drain off old nodes, then we decommission the old pool. This avoids in-place node upgrades and keeps disruption predictable.

The operational checklist

Kubernetes gives you power; it doesn't give you safety. That part you build.

Running Kubernetes in Production: Lessons From Real Clusters

Observability before you need it

Scaling that doesn't surprise you

Upgrades without drama

The operational checklist

Articles Connexes

Migrating a Monolith to AWS Without Downtime

Zero Trust Architecture for Cloud-Native Applications

When Serverless Makes Sense (And When It Doesn't)

Restez en avance

Running Kubernetes in Production: Lessons From Real Clusters

Observability before you need it

Scaling that doesn't surprise you

Upgrades without drama

The operational checklist

Articles Connexes

Migrating a Monolith to AWS Without Downtime

Zero Trust Architecture for Cloud-Native Applications

When Serverless Makes Sense (And When It Doesn't)

Restez en avance