This article is part of the series on the rust-http redesign, Teepee.
Header representation is a critical matter to Teepee’s design: it is uncompromisingly strongly typed, but there must be tradeoffs. After trying quite a few different schemes at length, I have settled upon quite a novel scheme which I believe to optimally balance all considerations.
The current situation
Here’s how rust-http currently handles headers:
… and then the header collection is placed into the request or response struct as the headers
field. Thus the value of the Content-Length
header of a response is response.headers.content_length
and is of type Option<uint>
. Note also that the request and response have different header collections, as there are various headers appropriate only to a request or response.
At present, headers are parsed proactively; valid headers are placed in a Some
value and invalid or missing values are stored as None
.
Criteria
There are a few criteria that I considered for header design in Teepee:
Performance must be “good”.
Ergonomics for using standardised headers must be “good”.
Ergonomics for using extension headers must be “good”.
Extension headers should be strongly typed.
Badly formatted headers must be accessible in some way.
A shared access technique for both standardised and extension headers is desirable (for backwards compatibility as more headers are added to the Teepee libraries), especially if it allows headers to be used as multiple types.
A sample of schemes tried
Bad header field value handling
For various reasons, it is necessary that invalid headers be able to be accessed, though normally an illegal value should be dropped. This was not done at all in rust-http, but is essential in Teepee. (For more fun, there are certain things to be aware of in parsing headers like that there may be multiple fields with the same name, but only if they would be equivalent to a single comma-separated list, with the difference that any illegal items should be dropped, or that in some cases part of a field may be invalid and be dropped while the rest is interpreted.)
Use a Option<Result<T, Vec<u8>>>
(or similar) type instead of Option<T>
. Rejected outright for (a) bad ergonomics, for almost no one can do anything with malformed values, so normal code must not be burdened with it; and (b) because partially bad headers would be lost.
Expose the unparsed values through callbacks that can be injected, or a conditions‐like scheme (minus the fail‐by‐default behaviour). Rejected because while the raw values can be accessed, doing so will be excessively difficult, leading to complaints when people do actually want it, as the layer or framework they are using may not have exposed it in an accessible way.
Store raw headers alongside the main place; for example, given request.headers
, have request.raw_headers
or request.headers.raw
as a HashMap<Vec<u8>, Vec<Vec<u8>>>
(that is, a mapping of field names to the values for that field) or a Vec<(Vec<u8>, Vec<u8>)>
(that is, all of the header fields, completely raw). (Try saying “vec” ten times rapidly.) This isn’t ideal for the overhead, as it requires two heap allocations for each header field (the lowest level will still operate without allocations there; this is purely a feature of the high level) and must still store each header in raw form.
Of the options shown here, the last is probably the best. I have one further objection to it, though: it makes it too easy to access those raw values. I’m afraid that if I put that there, people might use it (!), and I don’t want that. (Yes, I’m stubborn about all this; I want to give people what’s good for them, not necessarily what they want. We’ll find out if I’m right in the years to come.) This can be remedied by having the raw fields private and only exposed through unsafe functions.
There is one other scheme, building upon the concepts of the third option, where the handling of bad header fields falls quite naturally out of the solution for header storage; I omit that here is it will be expanded upon later.
Header storage
I shall not bore you with the full details, though I have plenty more details should you be interested; here is a smattering of some of the things I’ve considered:
Fields for standard headers, a Map<StrBuf, StrBuf>
for the rest: (a.k.a. “leave it as it is”) great performance, great ergonomics for standard headers, but extension headers are very much second class citizens, being weakly typed.
Fields for known headers, drop (or Map
) unknown headers, multiple inheritance edition: very cool and perfect in almost every way… except the wait for a feature that will probably never be implemented.
Fields for known headers, drop (or Map
) unknown headers, say‐what‐you‐want edition: (require people to declare the headers they expect to deal with and their types, producing a struct for each case). Great performance, poor ergonomics of declaration (especially for its unfamiliarity), useless for code reuse.
The other thing I’m about to talk about: good enough performance, good ergonomics for all headers, treats all headers alike; flawless with regards to backwards compatibility; allows read and write access to raw headers where necessary.
A combination of fields for standard headers with that thing I’m about to talk about for the rest: gets better performance for standard headers, but at the cost of consistency, future‐proofing and raw access to standard headers. (In short, a cute idea but not worth it.)
The scheme I have settled on
This scheme is not perfect, but weighing up the advantages and disadvantages I am fairly confident that it is fairly optimal for Rust.
Here are some examples of the API we end up with. We get nice and easy typed access:
(get()
necessarily takes &mut self
, so you can’t borrow more than one header at a time—that’s a large part of why get
is there, which clones the object before returning it.)
It would be possible to expose raw values via unsafe functions…
But that’s actually not necessary! We can use get
and set
with a different argument and, from that, a different return type:
(This will be slower than exposing raw values directly, as it is treating Vec<Vec<u8>>
as a strong header type and so incurring an extra conversion; therefore in practice extra unsafe methods will probably be provided for raw access. But it demonstrates the principle of the thing.)
And the good part—it’s in there, so now we can access it as another strong type:
How is this done? Internally it maintains a hash map of values which may be either raw or Box<Header>
(a trait object); initially everything is raw, but as you access them they are parsed into the appropriate header type and stored; if you try to access a value as a different type, it is converted to the HTTP wire format and then parsed as the new type, and stored if that succeeded.
All this is pretty magic, and something that you couldn’t easily achieve in most languages, especially with regards to the type inference—but it’s something that can be achieved in Rust in a very convenient and completely safe way! (OK, so it takes just a smidgeon of unsafe code, implementing the Any*Ext
traits—just a straight copy of the implementations for &Any
et al. But the interface it exposes is absolutely solid and crash‐proof.)
A proof‐of‐concept implementation
I could yammer on and on about how it works, but I don’t think it’s necessary. Here’s my proof of concept in all its half‐baked glory:
Assessment
Here’s how this goes on the criteria listed:
Performance must be “good”.
The good news: headers are parsed lazily. The bad news: two or three allocations per header field, plus an allocation and a deallocation for every type change (including that lazy parsing). Performance is, I think, the weakest part of this scheme, but I don’t believe anything faster is possible while still ticking all the other boxes.
Ergonomics for using standardised headers must be “good”.
There’s a certain perfection about field access… but this will do dandy.
Ergonomics for using extension headers must be “good”.
As above.
Extension headers should be strongly typed.
Got it.
Badly formatted headers must be accessible in some way.
Readily accessible, though accessing a header as a strongly typed value will destroy the raw stuff, so anyone caring about such things will need to be careful.
A shared access technique for both standardised and extension headers is desirable (for backwards compatibility as more headers are added to the Teepee libraries), especially if it allows headers to be used as multiple types.
This solution absolutely shines at this: if two libraries want to access the same header as a different type, it will happily convert between them, just sacrificing a little bit of performance for the benefit.
Unless anyone finds any significant problems with this scheme, or a better scheme, Teepee will be using this type of header collection.
Remaining questions
Should access to the raw values be granted via unsafe functions, or should typed access be genuinely the only interface exposed, leaving an implementation for Vec<Vec<u8>>
as the way of accessing raw values? (Side note: without that, the library will accept multiple header fields with the same name, but will merge them separated by a comma when writing them. This is perfectly spec‐compliant behaviour, but it is conceivable that someone may wish to genuinely write multiple header fields with the same name and not merge them. I guess I probably just talked myself into answering this question. Still, without such behaviour, the header collections for writing could skip the possibility of a raw value and just use a Box<Header>
directly instead of Item
, which would produce moderately trivial performance improvements. So I guess the quesiton is still open. Can anyone think of a way that one could use the std::fmt
interface and yet still allow the writing of multiple header fields from the one value?)
Is jumping on the tails of std::fmt
a good idea? (Bear in mind that std::fmt
is UTF-8‐centric while HTTP is ISO-8859-1‐centric, though I do not really know if that influences thing.)
At present, the header collections for requests and responses have different fields; should the header markers use phantom types to specify that they are only legal for requests or responses (or both)? That would make it so that request.headers.set(REFERER, …)
would work while response.headers.set(REFERER, …)
would fail to compile.
I now invite you to join in the discussion at /r/rust.