As a platform engineer working with multiple platform teams running applications on the Tanzu Platform for Cloud Foundry, I have been addressing intermittent issues that have been overlooked as quirks of the platform. These issues are what I like to call developer paper cuts. They are points of friction where something does not function as expected but occurs infrequently, leading to a lack of prioritization for investigation. One of these developer paper cuts is the occurrence of the HTTP 502 response code from services operating on the platform. In this post, I will detail our investigation into this issue, highlight some known causes of the 502 response code, and provide tips for debugging similar issues.
The HTTP 502 Response
Before delving into the investigation of the issue, it is essential to understand why a service might return a HTTP 502 status code. The definition of the 502 status code, as outlined in RFC 7231, is quite broad.
“6.6.3. 502 Bad Gateway The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request.”
– RFC 7231
The naming, Bad Gateway, is misleading. I’ve seen it interpreted as meaning that the gateway itself has failed. In reality, the gateway has received what it perceives to be an invalid response from an upstream service.
The Cause
In a typical request flow for a service on Cloud Foundry, a client sends a request to a load balancer, which then forwards it to a Gorouter instance. The Gorouter further forwards the request to an instance of the target service, which may make additional requests to external services like databases. The response from the database is processed before being returned to the client in response to the inbound HTTP request, passing through the Gorouter and load balancer on the return path.
Within this chain, both the load balancer and the Gorouter act as gateways. The initial step in our investigation is to determine which component is responsible for sending the HTTP 502 response code: the load balancer or the Gorouter.
- If the load balancer sends the HTTP 502, it indicates that the load balancer received an invalid response from the Gorouter. The Gorouter should never send an invalid response regardless of the response received from the upstream application. Any such behavior would signify a bug in the Gorouter.
- If the Gorouter sends the HTTP 502 response, it means that the Gorouter received an invalid response from the upstream service on the platform. It is crucial to identify why the Gorouter deems the service response as invalid.
To quickly ascertain if the HTTP 502 response originates from the Gorouter, we inspect the application logs. On Cloud Foundry, the application log stream integrates application logs with logs from platform components such as the Gorouter. For example, using the command below to retrieve logs for the application named drop-con:
cf logs drop-con
An extract of the logs showing lines related to our failing request is shown below.
2024-08-16T11:39:44.28+0100 [APP/PROC/WEB/0] OUT {"time":"2024-08-16T10:39:44.286101777Z","level":"INFO","msg":"dropping connection","from":"10.255.205.26:47996","method":"GET","url":"/drop","traceparent":"","tracestate":"","X-B3-TraceId":"7a632f5f6ccd430352e5a1cbbaa929ed","X-B3-SpanId":"52e5a1cbbaa929ed"}
2024-08-16T11:39:44.28+0100 [RTR/0] OUT drop-con.apps.bolt-2438878.cf-app.com - [2024-08-16T10:39:42.278381365Z] "GET /drop HTTP/1.1" 502 0 67 "-" "HTTPie/3.2.3" "82.36.210.19:54363" "10.0.4.8:61022" x_forwarded_for:"82.36.210.19" x_forwarded_proto:"http" vcap_request_id:"7a632f5f-6ccd-4303-52e5-a1cbbaa929ed" response_time:2.009132 gorouter_time:0.000307 app_id:"6b881b62-849c-4a8a-af39-66270352c774" app_index:"0" instance_id:"bd4a0e3f-567d-4aa8-40cd-ca5a" x_cf_routererror:"endpoint_failure (EOF)" x_b3_traceid:"7a632f5f6ccd430352e5a1cbbaa929ed" x_b3_spanid:"52e5a1cbbaa929ed" x_b3_parentspanid:"-" b3:"7a632f5f6ccd430352e5a1cbbaa929ed-52e5a1cbbaa929ed"
2024-08-16T11:39:59.11+0100 [APP/PROC/WEB/0] OUT {"time":"2024-08-16T10:39:59.112277074Z","level":"INFO","msg":"request","from":"10.255.205.26:56384","method":"GET","url":"/hi","traceparent":"","tracestate":"","X-B3-TraceId":"","X-B3-SpanId":""}
In this log, we observe that the Gorouter log line indicates the response with an HTTP 502 status code:
"GET /drop HTTP/1.1" 502
This information confirms that the Gorouter perceives an invalid response from the application. Additionally, the Gorouter logs offer further insights into the error:
x_cf_routererror:"endpoint_failure (EOF)"
This additional information provides clarity on why the service response was considered invalid. The Gorouter encountered an unexpected EOF (End of File) response from the application. Here we are assuming “File” is used to represent the response message.
To validate this assumption, I developed an application to accept an HTTP request and immediately terminate the connection before sending a response. By deploying this intentionally misbehaving application, it became possible to correlate application behavior with the detailed error messages logged by the Gorouter during HTTP 502 events.
HTTP 502s on Cloud Foundry
We were able to observe two instances of early response termination that resulted in HTTP 502 response codes from the Gorouter. The error message logged by the Gorouter varies depending on what headers it received from the upstream service.
Example 1 - Service terminates connection with no response:
Client error: 502 Bad Gateway: Registered endpoint failed to handle the request.
Router error: x_cf_routererror:“endpoint_failure (EOF)”
Example 2 - Service sends first response header then terminates connection:
Client error: 502 Bad Gateway: Registered endpoint failed to handle the request.
Router error: endpoint_failure (net/http: HTTP/1.x transport connection broken: unexpected EOF)
The third example highlights a scenario where premature termination of the response message does not result in a 502 status code.
Example 3 - Service sends content-length header then terminates connection:
Client error: http: error: ChunkedEncodingError: (‘Connection broken: IncompleteRead(2 bytes read, 1 more expected)’, IncompleteRead(2 bytes read, 1 more expected)
Router error: none
Resolution
In the specific investigation discussed, the consistent occurrence of the x_cf_routererror:"endpoint_failure (EOF)”
message alongside HTTP 502 responses allowed for a clear explanation based on application behavior.
The presence of the HTTP 502 message in the logs signifies that the Gorouter believes the application response is invalid. The endpoint_failure (EOF)
message indicates that the application accepted the request but terminated the connection before transmitting any data.
One possible cause for early termination could be an application crash while processing an inbound request. We ruled this out as further examination revealed the application did not crash and went on to subsequently process requests successfully. To delve deeper into why an application unexpectedly terminated connections, an analysis of the application code would be essential.
Being able to articulate the response in terms of application behavior allowed the team to link the issue to a known problem with Spring-backed applications running on Cloud Foundry. A KB article (298104) outlined potential solutions to the problem.
“A race condition can occur between the Gorouter and a Spring backend application when keep-alives are enabled between the two servers. This race condition results in a 502 response code for the request and there is logs associated with the failed request that read “EOF” in the Gorouter stderr and stdout logs.” – Intermittent 502 EOF Gorouter errors for Spring Apps
In conclusion, our approach to investigating one cause of HTTP 502 responses enhanced our understanding of what might cause HTTP 502 responses in general. We were able to map details in the error messages with specific instances of application behavior.
Recommendations
To aid in investigating similar intermittent issues with applications on a platform, consider the following recommendations:
Log Inbound Requests Before Processing
Initiate the logging of inbound requests before processing them. This practice allows for the immediate confirmation that the application received the inbound request irrespective of the ultimate response. This aids investigation by narrowing the focus to application behaviour.
During Investigation State Your Theory and then Validate
When troubleshooting complex systems like application platforms, formulate theories regarding the issue’s cause and validate them through testing. This process of validation can confirm or refine understanding of the system under investigation. In some cases, creating applications that deliberately fail in controlled ways can eliminate the intermittent nature of problems, enabling thorough evaluation of potential causes for HTTP 502 responses and enhancing communication with application teams.
References
The following resources were referenced in this blog post: