Parsing HTTP Responses in Ruby
Normally handling HTTP responses in Ruby is rather straight forward. There is a native library in Ruby that handles HTTP requests which parses the responses into a neat data structure that you can then operate on. What if you want to work on stored HTTP responses outside of a connection though? This was the situation I found myself in and thanks to a series of unusual decisions in the Ruby core library I found myself left out in the cold.
For reference this is in the latest stable Ruby as of this writing (2.5.1).
Let's start with a very small HTTP response stored in a variable for us to test on:
The above is a little bit weird but is a minimum reasonable HTTP response. All lines are approprietly terminated with both a carriage return (explicit) and a newline (implicit in how the strings are defined). The "Content-Length" header is the exact number of bytes present in the body (thus the two "#rstrip" calls). The "Date" header was omitted due to this line in RFC7231
An origin server MUST NOT send a Date header field if it does not have a clock capable of providing a reasonable approximation of the current instance in Coordinated Universal Time.
...Which the content of this static site does not have.
With our minimal response out of the way how do we go about parsing it? The Ruby 2.5.1 stdlib documentation doesn't specify how it can be created by end users which usually means it isn't intended for use by users of the language directly and digging through the Ruby source, you'll see this is precisely the case. Which means Ruby does not have a HTTP response parser available in it's standard library. This is pretty frustrating, but maybe it can be worked around.
How does the "Net::HTTP" library make use of it? Even if the methods aren't listed for public documentation they're still public APIs on the class and should be able to be used without monkey patching right? The response is setup in the connect method of Net::HTTP and it comes down to a few relevant lines that can be summarized as:
- Open a socket to the web server
- Write the formatted request to the socket
- Pass the socket to "HTTPResponse#read_new"
So we need a socket like object containing our response, which we can do with "StringIO" and pass it to the appropriate method. Let's see what happens:
We get a raised exception:
|
|
That is definitely a valid status line, so what is going on here? Back to Ruby's source code... "Net::HTTPResponse#read_new" starts off by calling "Net::HTTPResponse#read_status_line" which uses this regex for extracting and checking the validity of the status line:
|
|
I had never seen the "/n" modifier for Ruby's regular expressions and it seems to be completely undocumented. This turned out to be a red herring as it simply sets Regexp::NOENCODING
(had to dig into the spec/ruby/core/regexp/options_spec.rb file to figure that one out).
So why isn't that regular expression matching? Spoiler: It's the newline (the carriage return is fine). That is a violation of the HTTP spec, but it is working normally for Ruby's HTTP requests so what gives? Apparently we have to go deeper...
It's getting the header string by calling #readline
which on standard IO objects returns the newline (The IO
class if the base for StringIO
, and Socket
objects in addition to many others). In Ruby 2.4 and later there is a chomp flag that changes this behavior but it isn't being used in this case, and it would take the carriage return with it if it was.
So... We must not be operating on an actual IO
subclass... And sure enough, Net::HTTP#connect
after getting the raw socket wraps it in a Net::BufferedIO
object which is another internal hidden class. You can see the definition of it here and here is its #readline
method:
Yep, for some reason this one private internal API has decided to complicate a Ruby standard API convention and strip off the trailing carriage return and new line. Wrapping our StringIO
object in a BufferedIO
object does solve this problem but there is no reason for these complications...
Or does it?
We need to pull one more trick from the Net::HTTP#transport_request
to get the body. The first line actually returns the body, but we want to treat this like a normal HTTPResponse
so we want to make sure the #body
method works:
There are a couple of differences still from a normal response body. The only one of particular note to me is that normally the response get it's #uri
data from the request. This isn't available with the response alone but can be set pretty easily:
Altogether this is what it looks like:
|
|
You now have a valid Net::HTTPResponse
object