Inspecting HTTP Response Headers Without Downloading Body with Guzzle
I recently needed to inspect the HTTP response headers of a very large file download in order to determine if we should commit to downloading the file based on the ETag
header. If the ETag
header hadn't changed for a file we've already downloaded previously, our application could skip the download entirely and simply use the file that we already downloaded. Some of these files are multiple gigabytes in size, so the time savings from this optimization really adds up.
The most obvious and immediate solution I reached for was to issue a HEAD
request to the URL, which would just return the HTTP headers without the response body (thus, not actually downloading the file). This didn't work out as well as I expected. Some of the URLs I was working with were signed S3 URLs, and they were
signed only to allow GET
requests and would return a 403 Forbidden
status code for HEAD
requests. Additionally, this requires the server on the other end properly implementing HEAD
requests, and I'd rather not rely on that being the case for arbitrary URLs.
So, I needed to issue actual GET
requests with Guzzle, but somehow avoid downloading the response body. From looking at the documentation I found the on_headers request option. This seemed promising:
A callable that is invoked when the HTTP headers of the response have been received but the body has not yet begun to download.
That's exactly what I need - some way to inspect the response headers before receiving the body of the request! It even looks like you can throw an exception inside the on_headers
callable to abort the request.
After some experimenting I landed on this getHeaders()
function:
1private function getHeaders(string $url): array 2{ 3 $response = null; 4 5 try { 6 $this->guzzle->get($url, [ 7 'on_headers' => function (ResponseInterface $responseWithOnlyHeaders) use (&$response) { 8 $response = $responseWithOnlyHeaders; 9 throw new BlockResponseBodyDownload();10 },11 ]);12 } catch (RequestException $e) {13 if (get_class($e->getPrevious()) !== BlockResponseBodyDownload::class) {14 throw $e;15 }16 }17 18 // Have to manually follow redirects when using `on_headers`.19 if (in_array($response->getStatusCode(), [301, 302, 307, 308])) {20 return $this->getHeaders($response->getHeader('Location')[0]);21 }22 23 return $response->getHeaders();24}
As you can see, we issue a GET
request to the provided URL. In our on_headers
callable we receive the HTTP response in $responseWithOnlyHeaders
which we save for later. Then we immediately throw a BlockResponseBodyDownload
exception, which aborts the downloading of the rest of the HTTP response. This exception should be a very specific one that's used only for this purpose, as if you use a generic \Exception
it will be hard to deal with alongside native Guzzle exceptions. I named it BlockResponseBodyDownload
simply to make it very clear what this exception does to the next developer who needs to work on this code.
When you throw an exception in on_headers
, internally Guzzle will convert it to its own RequestException
and pass your exception into the $previous
parameter of RequestException
. So in order to differentiate between our BlockResponseBodyDownload
exception and Guzzle's native RequestException
s, we need to access the previous exception via $e->getPrevious()
and check if it's our exception. If it is, simply ignore it. If it's not, re-throw it.
The only other caveat to this solution is that I noticed Guzzle no longer automatically follows redirects when you use on_headers
. Some of our file download URLs did redirect, so I had to manually implement redirects by checking if the status code of the response was a redirecting status code, and then calling the same function recursively with the URL given in the Location
header.
I figured the easiest way to ensure this works as intended is to simply try to get the headers of a very large file download, and see how long it takes. I wrote this quick test script using a 1GB speed test file from Hetzner.
1$start = microtime(true);2 3$headers = $this->getHeaders('https://speed.hetzner.de/1GB.bin');4 5echo sprintf('Retrieved ETag header (%s) in %.2F seconds', $headers['ETag'][0], microtime(true) - $start);
Which outputs:
1Retrieved ETag header ("5253f10e-3e800000") in 0.79 seconds
with a very consistent time on multiple test runs. If we comment out our exception in the on_headers
callable so that we don't abort the request after getting the headers:
1'on_headers' => function (ResponseInterface $responseWithOnlyHeaders) use (&$response) {2 $response = $responseWithOnlyHeaders;- throw new BlockResponseBodyDownload(); +// throw new BlockResponseBodyDownload(); 5},
and then re-run the test, we see the time it takes to retrieve the headers skyrocket, because it is fully downloading the 1GB file contained in the HTTP response body:
1Retrieved ETag header ("5253f10e-3e800000") in 207.33 seconds
That's good enough proof for me!