GCP Health Checks: Why So Often?

Martin Beránek
DevOps
Lectures
February 15, 2022

After running a few projects with plenty of microservices in GKE, we noticed that we collect a lot of logs about health checks. In most cases, those logs were not saying anything interesting. Just HTTP 200 and that was it. What caught my attention was how many types of those are in GCP. Is there a way to avoid this number of health checks? Could you make it work just with K8s probes? Let’s find out.

Let's establish what type of health checks we have in GCP:

GCE health check defined at load balancer level

Those can be divided by the target they are measuring:

a) Instance groups – original option for compute instances running K8s services on a particular port. This approach is obsolete and it is easily avoidable for K8s use cases. Ackee currently uses this for the non-K8s workload, e.g. the Elasticsearch cluster where each instance runs the Elasticsearch service.

Before the introduction of Network Endpoint Groups, each node in GKE had to provide a fixed port of the K8s service: that was done by NodePort. Health check checked each node of the cluster in the Instance Group for the load balancer. Since every node provided the port for the service, all the nodes were in the Instance Group – no matter whether nodes were running the pods with the app or not.

In case the node got broken somehow, it was removed from the Instance Group until it got fixed again. In this case, the GCE health check makes perfect sense and its use cannot be questioned. GCE health checks are blind in the matter of internal K8s workload but in this scenario, you had no choice. You always had to trust K8s internal traffic management to direct traffic to correct nodes.

b) Network Endpoint Groups (or NEG) – are the "new" (introduced a few years ago) way to forward traffic in GCE. Services do not need to run on Compute instances, only an endpoint composed of IP address and port should be enough. Each service can advertise its presence in GCE as a Network Endpoint labeled as :.

This makes a significant impact on TTFB since traffic doesn't need to flow to nodes without any running pods, it can be directed straight to the pods.

At this point, I was tempted to claim that only the whole Network Endpoint Group will receive GCE health checks as it is represented as one backend in the load balancer setup. That would mean if I had two pods and deleted one, the number of health checks from GCE would stay the same. That is not true: see the following histogram:

histogram showing that Each pod receives its own health check with the same interval defined in the GCE health check setup

Each pod receives its own health check with the same interval defined in the GCE health check setup.

Note: These health checks could be located in the Logs Explorer under a query for "GoogleHC/1.0".

K8s “health checks”

There is nothing strange about those. They basically evaluate the pod fitness. For further details, check the documentation. Also, K8s is not checking health, it’s probing it.

The real difference is in whether the traffic is directed to the pods or not. Let’s read the documentation about readiness probes:

> Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.

Note: You can see the probes in the logs with a query containing "kube-probe/1.21".

That means we should provide readiness probes to K8s and move on, right? But what about GCE health checks? Can we just disable them or set the interval to minutes and move on?

From my empirical observation of monitoring: once the K8s pod has a load issue, it stops responding to the readiness probes and is temporarily removed from service until it starts to respond to the readiness probes. Once the pod is not used in the service, GCE health checks will mark it as not healthy and it will be removed from the NEG and GCE load balancer.

This means you can’t just disable one thing for the other and if everything runs correctly, you just need to answer both health checks.

Setting the correct interval

In this case, we also have to differentiate between GCE health checks and K8s probes. From f5 support:

> For any given service, set the health check to a time that reflects the maximum delay in response that is acceptable for customer access.

Let’s say you already have defined SLO with your customer and you are obligated to:

2400 ms of latency at the worst-case
Uptime is defined only as 99.9% therefore we can also afford 43 minutes of downtime a month
The allowed error rate is at 1%, which can mean let’s say 10k errors a month

This suggests having health checks around 3 seconds. That poses multiple issues:

Setting health checks that frequently could lead to a few failed checks at the beginning of the runtime
- The first understanding might be that it would manifest in GCE health checks but those apply only to services passing readiness probes and readiness probes include initialDelaySecondsto avoid this issue
For our microservices, health checks are small integration tests due to many issues we had in past (Pub/Sub, SQL proxy, ...), running them too often may cause load issues

There might be a compromise, therefore I googled more. From a Reddit post:

> To address #1, it is best to do an end-to-end test at startup to be sure new releases are good. It should be as lightweight as possible but hit the primary subsystems.

> To address #2 you need the endpoint to do a quick check on the value of a flag variable and return healthy or not based on that value. You keep that variable up to date in your error handling. If you start getting too many errors (type of error will vary based on the app) you flip the flag to unhealthy. You can optionally flip it back based on other criteria.

That sounds like a reasonable compromise between both initial runtime and long lifetime. But that would be a bit hard to implement. Also, the health variable would be challenging to track in retrospect unless you log its content on any change.

I would rather keep it as simple as possible. In a naive way, we can suggest having readiness and GCE health check set to the same interval.

Regarding the intervals: F5 also suggests a rule of 3:1 ratio for health checks:

> An interval is measured in seconds. As a rule, the timeout interval should be 3 times the frequency, plus one second.

We also know that for services behind a load balancer, we have only 30 seconds to deliver the response until the connection is closed.

For that, I would also like to divide the final intervals into two options:

App serving content to the client

K8s probes:

Initial readiness & liveness initialDelaySeconds should reflect application needs, for NodeJs apps, this interval is largely irrelevant (startup is usually fast)
Readiness periodSeconds should reflect SLA, generally, 3–5 seconds seems like a good interval for almost any service
Readiness timeoutSeconds should be 3 times plus 1 second of periodSeconds -> 10–16 seconds
- But also could go up to 30 seconds because, after that, data are not delivered through load balancer anyway
Liveness doesn't seem to be important in the case of traffic forwarding, it should rather reflect system requirements (Pub/Sub weirdly disconnected, memory is running out, ...)

GCE health check:

GCE health check should reflect load balancer timeout and readiness probes, therefore:
- an interval of 3–5 seconds & a timeout of 10–16 seconds should be reasonable

App consuming the queue

In this case, both readiness & liveness probes should reflect the time needed for processing a message in a queue. Since we are not delivering any messages through the load balancer, I would set it to values as high as possible:

Liveness probes should be set to the highest time required to process the item in the queue, let's say a minute should be enough

GCP health checks: reflexion

Did I give an answer about what to do with having a lot of health checks? Well, probably not. If anything, I have shown that having a lot of them can be reasonable. I wanted to create a blog post to describe how complicated GCP health checks are. Since this text is long and a bit obfuscated already, I think I proved my point.

If there is anything important I would like you to take away from the blog post, it could be the note from F5 documentation:

> For any given service, set the health check to a time that reflects the maximum delay in response that is acceptable for customer access.

I was reflecting on that by estimating a clear relationship between health checks and SLO. Whenever you wonder what period and what frequency you should choose, check your SLO documentation and the answer is right there.

Also, having logs for correct responses to the health checks might be unnecessary. In the end, we decided to filter the logs and log only failed health checks. That helps us to save a few schmeckles. You could do that by this resource google_logging_project_sink in Terraform:

resource "google_logging_project_sink" "default_sink" {  
  destination            = "logging.googleapis.com/projects/…/locations/global/buckets/_Default"  
  disabled               = false  
...  
  exclusions {  
    disabled = false  
    filter   = <<-EOT  
         severity=INFO AND  
        jsonPayload.user_agent=~“GoogleHC/.*” OR  
        jsonPayload.user_agent=~“kube-probe/.*” OR  
        jsonPayload.req.headers.“user-agent”=~“GoogleHC/.*” OR  
        jsonPayload.req.headers.“user-agent”=~“kube-probe/.*” OR  
          
jsonPayload.userAgent=“GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)” OR

httpRequest.userAgent=“GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)”  
        EOT  
    name     = “ExcludeHC”  
  }  
}

Only health checks with severity different than INFO are going to be present in the logs, which helps us to investigate any issues we might have.

Hopefully, you enjoyed reading this post and it can help you decide what interval is the best for you. If you were able to find better reading regarding the matter, please, I beg you, write a comment.

Martin Beránek

DevOps Team LeadMartin spent last few years working as an architect of the Cloud solutions. His main focus ever since he joined Ackee is implementing procedures to speed up the whole development process.