Lessons from Debugging 503s in Kubernetes with ArgoCD and HPA
The dashboard was a sea of red. Our error rates were climbing faster than I could debug. Every refresh showed more 503 errors, more frustrated users, and mounting pressure to find the root cause.
What looked deceptively simple - just another restart issue - turned out to be a web of misconfigurations involving ArgoCD, health checks, and the Horizontal Pod Autoscaler, all fighting each other in ways I never expected. By the end of the investigation, the error rate had dropped from hundreds of failed requests to just 5 errors out of 50,000. More importantly, this case study taught me what “acceptable” really means in distributed systems.
The first clue
The first clue came when I realized the Kubernetes pods didn’t have readiness or liveness probes configured. That explained why, when an instance was shutting down, incoming requests would sometimes land on a pod that was already halfway out the door. Kubernetes didn’t know the pod was unhealthy because there was no probe to signal it.
When I checked the manifests, I discovered the health check configuration was missing entirely. It wasn’t that the configuration was forgotten. It had never been passed through from ArgoCD, which was managing the deployments. A misconfiguration in the Argo setup meant that probe definitions weren’t being applied.
Here’s what was missing:
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20Once the manifests were fixed so that Argo propagated the probes correctly, the 503 error rate dropped significantly. But the problem wasn’t completely gone.
The real problem emerges
Under sustained load, the errors returned. This time, the culprit wasn’t health checks. It was scaling behavior. Kubernetes’ Horizontal Pod Autoscaler was doing exactly what it should: scaling the deployment up and down based on CPU/memory thresholds. But something strange kept happening. Whenever HPA scaled the deployment, ArgoCD would overwrite the replicas setting back to what was in Git.
It was like watching two controllers fight each other. HPA would say “Traffic is heavy, let’s add pods,” and Argo would respond “Nope. Git says 3 replicas. Back you go.” This tug-of-war meant pods were being removed at inopportune times, and the system kept oscillating between scaling and rollback. The result: instability, and more 503s.
The fix was to tell ArgoCD to ignore the replicas field so that it wouldn’t overwrite what HPA was dynamically setting. Argo has a configuration option for this:
spec:
syncPolicy:
automated:
prune: true
selfHeal: true
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicasAfter applying this configuration, ArgoCD respected HPA’s decisions, and the autoscaler was free to manage replicas without interference.
ignoreDifferences to work with the system’s natural behavior instead of fighting against it.The breakthrough realization
At this point, things looked good. The error rate dropped dramatically: from dozens per minute to just 5 failed requests out of 50,000 total. And that’s where the most important realization happened.
The initial instinct was: “Why do we still have any 503s? Can’t we eliminate them completely?” The truth is, in distributed systems, perfection is unrealistic. When a pod scales down, there will always be a tiny window where inflight requests can get dropped. Readiness probes and graceful shutdown settings can reduce this, but they can’t drive it to zero.
The real goal isn’t zero errors. It’s ensuring the error rate is within acceptable thresholds for your SLOs. In this case, 5 errors out of 50,000 requests (0.01%) was far below the acceptable error budget. Instead of chasing perfection, the focus shifted to observability and SLOs: Are we measuring success against clear reliability targets? Do we understand the trade-off between availability and elasticity? Can we demonstrate that the system is resilient enough under real-world conditions?
This reframing turned the issue from “why isn’t it perfect?” into “how do we keep it reliable as it scales?”
The strategic framework
This case study taught me that in cloud-native systems, problems rarely come from one component. More often, they emerge from the interactions between controllers. In this case, ArgoCD enforcing Git state, Kubernetes managing pods, and HPA scaling dynamically.
The key lessons were simple: Always define health checks from day one because without readiness/liveness probes, Kubernetes can’t make smart decisions about routing traffic. Align GitOps and autoscaling by using ignoreDifferences to prevent Git from overwriting fields that should be runtime-managed. Expect some errors during scaling because scaling down gracefully is hard and inflight requests sometimes get dropped. Focus on reducing the error window, not eliminating it completely. Measure success with SLOs by defining an error budget and measuring reliability against it.
By the end, the 503 errors were resolved, and the key takeaway was clear: resilience, not perfection, is the goal. Next time you see unexplained errors under load, don’t just look at pods in isolation. Step back and ask: What controllers are shaping this system? Are they aligned, or are they fighting each other? Because sometimes, the biggest bug isn’t in your code. It’s in how your tools work together.