Teepee design: header representation

This article is part of the series on the rust-http redesign, Teepee.

Header representation is a critical matter to Teepee’s design: it is uncompromisingly strongly typed, but there must be tradeoffs. After trying quite a few different schemes at length, I have settled upon quite a novel scheme which I believe to optimally balance all considerations.

The current situation

Here’s how rust-http currently handles headers:

struct HeaderCollection {
    // … plenty of strongly typed headers …
    connection: Option<Vec<Connection>>,
    content_type: Option<MediaType>,
    // … plenty more s.t. headers … and then at the end:
    extensions: HashMap<StrBuf, StrBuf>
}

… and then the header collection is placed into the request or response struct as the headers field. Thus the value of the Content-Length header of a response is response.headers.content_length and is of type Option<uint>. Note also that the request and response have different header collections, as there are various headers appropriate only to a request or response.

At present, headers are parsed proactively; valid headers are placed in a Some value and invalid or missing values are stored as None.

Criteria

There are a few criteria that I considered for header design in Teepee:

Performance must be “good”.
Ergonomics for using standardised headers must be “good”.
Ergonomics for using extension headers must be “good”.
Extension headers should be strongly typed.
Badly formatted headers must be accessible in some way.
A shared access technique for both standardised and extension headers is desirable (for backwards compatibility as more headers are added to the Teepee libraries), especially if it allows headers to be used as multiple types.

A sample of schemes tried

Bad header field value handling

For various reasons, it is necessary that invalid headers be able to be accessed, though normally an illegal value should be dropped. This was not done at all in rust-http, but is essential in Teepee. (For more fun, there are certain things to be aware of in parsing headers like that there may be multiple fields with the same name, but only if they would be equivalent to a single comma-separated list, with the difference that any illegal items should be dropped, or that in some cases part of a field may be invalid and be dropped while the rest is interpreted.)

Use a Option<Result<T, Vec<u8>>> (or similar) type instead of Option<T>. Rejected outright for (a) bad ergonomics, for almost no one can do anything with malformed values, so normal code must not be burdened with it; and (b) because partially bad headers would be lost.
Expose the unparsed values through callbacks that can be injected, or a conditions‐like scheme (minus the fail‐by‐default behaviour). Rejected because while the raw values can be accessed, doing so will be excessively difficult, leading to complaints when people do actually want it, as the layer or framework they are using may not have exposed it in an accessible way.
Store raw headers alongside the main place; for example, given request.headers, have request.raw_headers or request.headers.raw as a HashMap<Vec<u8>, Vec<Vec<u8>>> (that is, a mapping of field names to the values for that field) or a Vec<(Vec<u8>, Vec<u8>)> (that is, all of the header fields, completely raw). (Try saying “vec” ten times rapidly.) This isn’t ideal for the overhead, as it requires two heap allocations for each header field (the lowest level will still operate without allocations there; this is purely a feature of the high level) and must still store each header in raw form.

Of the options shown here, the last is probably the best. I have one further objection to it, though: it makes it too easy to access those raw values. I’m afraid that if I put that there, people might use it (!), and I don’t want that. (Yes, I’m stubborn about all this; I want to give people what’s good for them, not necessarily what they want. We’ll find out if I’m right in the years to come.) This can be remedied by having the raw fields private and only exposed through unsafe functions.

There is one other scheme, building upon the concepts of the third option, where the handling of bad header fields falls quite naturally out of the solution for header storage; I omit that here is it will be expanded upon later.

Header storage

I shall not bore you with the full details, though I have plenty more details should you be interested; here is a smattering of some of the things I’ve considered:

Fields for standard headers, a Map<StrBuf, StrBuf> for the rest: (a.k.a. “leave it as it is”) great performance, great ergonomics for standard headers, but extension headers are very much second class citizens, being weakly typed.
Fields for known headers, drop (or Map) unknown headers, multiple inheritance edition: very cool and perfect in almost every way… except the wait for a feature that will probably never be implemented.
Fields for known headers, drop (or Map) unknown headers, say‐what‐you‐want edition: (require people to declare the headers they expect to deal with and their types, producing a struct for each case). Great performance, poor ergonomics of declaration (especially for its unfamiliarity), useless for code reuse.
The other thing I’m about to talk about: good enough performance, good ergonomics for all headers, treats all headers alike; flawless with regards to backwards compatibility; allows read and write access to raw headers where necessary.
A combination of fields for standard headers with that thing I’m about to talk about for the rest: gets better performance for standard headers, but at the cost of consistency, future‐proofing and raw access to standard headers. (In short, a cute idea but not worth it.)

The scheme I have settled on

This scheme is not perfect, but weighing up the advantages and disadvantages I am fairly confident that it is fairly optimal for Rust.

Here are some examples of the API we end up with. We get nice and easy typed access:

headers.get(CONNECTION); // -> Option<Vec<Connection>>
headers.get_ref(LOCATION); // -> Option<&url::Url>
headers.get_mut_ref(DATE); // -> Option<&mut time::Tm>
headers.set(LOCATION, url /* url::Url */);

(get() necessarily takes &mut self, so you can’t borrow more than one header at a time—that’s a large part of why get is there, which clones the object before returning it.)

It would be possible to expose raw values via unsafe functions…

unsafe {
    headers.get_raw("connection"); // -> Option<&[Vec<u8>]>
    headers.set_raw("connection",
                    vec!(Vec::from_slice(bytes!("close"))));
}

But that’s actually not necessary! We can use get and set with a different argument and, from that, a different return type:

headers.get(raw("connection")); // -> Option<Vec<Vec<u8>>>
headers.set(raw("connection"),
            vec!(Vec::from_slice(bytes!("close"))));

(This will be slower than exposing raw values directly, as it is treating Vec<Vec<u8>> as a strong header type and so incurring an extra conversion; therefore in practice extra unsafe methods will probably be provided for raw access. But it demonstrates the principle of the thing.)

And the good part—it’s in there, so now we can access it as another strong type:

assert_eq!(headers.get(CONNECTION), Some(vec!(Close)));

How is this done? Internally it maintains a hash map of values which may be either raw or Box<Header> (a trait object); initially everything is raw, but as you access them they are parsed into the appropriate header type and stored; if you try to access a value as a different type, it is converted to the HTTP wire format and then parsed as the new type, and stored if that succeeded.

All this is pretty magic, and something that you couldn’t easily achieve in most languages, especially with regards to the type inference—but it’s something that can be achieved in Rust in a very convenient and completely safe way! (OK, so it takes just a smidgeon of unsafe code, implementing the Any*Ext traits—just a straight copy of the implementations for &Any et al. But the interface it exposes is absolutely solid and crash‐proof.)

A proof‐of‐concept implementation

I could yammer on and on about how it works, but I don’t think it’s necessary. Here’s my proof of concept in all its half‐baked glory:


/// The data type of an HTTP header for encoding and decoding.
pub trait Header: Any {
    fn parse_header(s: &[Vec<u8>]) -> Option<Self>;

    /// Introducing an `Err` value that does *not* come from the writer is incorrect behaviour and
    /// may lead to task failure in certain situations. (The primary case where this will happen is
    /// accessing a cached Box<Header> object as a different type; then, it is shoved into a buffer
    /// through fmt_header and then back into the new type through parse_header. Should the
    /// fmt_header call have return an Err, it will fail.
    fn fmt_header(&self, f: &mut fmt::Formatter) -> fmt::Result;
}

// I'd prefer this as another trait with an implementation of Header for something satisfying that
// trait, but I can't do that because of those pesky "conflicting implementations" things.
macro_rules! require_single_field {
    ($field_values:expr) => ({
        let mut iter = $field_values.iter();
        match (iter.next(), iter.next()) {
            (Some(ref field_value), None) => field_value.as_slice(),
            _ => return None,
        }
    })
}

// impl copied from std::any. Not especially nice, sorry :-(
impl<'a> AnyRefExt<'a> for &'a Header {
    #[inline]
    fn is<T: 'static>(self) -> bool {
        use std::intrinsics::TypeId;
        // Get TypeId of the type this function is instantiated with
        let t = TypeId::of::<T>();

        // Get TypeId of the type in the trait object
        let boxed = self.get_type_id();

        // Compare both TypeIds on equality
        t == boxed
    }

    #[inline]
    fn as_ref<T: 'static>(self) -> Option<&'a T> {
        if self.is::<T>() {
            Some(unsafe { self.as_ref_unchecked() })
        } else {
            None
        }
    }
}

trait UncheckedAnyRefExt<'a> {
    /// Returns a reference to the boxed value, assuming that it is of type `T`. This should only be
    /// called if you are ABSOLUTELY CERTAIN of `T` as you will get really wacky output if it’s not.
    unsafe fn as_ref_unchecked<T: 'static>(self) -> &'a T;
}

// Actually, I want to move this upstream. I have a patch ready which adds as_ref_unchecked,
// as_mut_unchecked and move_unchecked. What do you reckon? Opinions solicited. Of course, because
// this is Header rather than Any, I can't actually *use* such a thing myself if it's on Any...
impl<'a> UncheckedAnyRefExt<'a> for &'a Header {
    #[inline]
    unsafe fn as_ref_unchecked<T: 'static>(self) -> &'a T {
        use std::raw::TraitObject;
        use std::cast::{transmute, transmute_copy};

        // Get the raw representation of the trait object
        let to: TraitObject = transmute_copy(&self);

        // Extract the data pointer
        transmute(to.data)
    }
}

trait UncheckedAnyMutRefExt<'a> {
    /// Returns a reference to the boxed value, assuming that it is of type `T`. This should only be
    /// called if you are ABSOLUTELY CERTAIN of `T` as you will get really wacky output if it’s not.
    unsafe fn as_mut_unchecked<T: 'static>(self) -> &'a mut T;
}

impl<'a> UncheckedAnyMutRefExt<'a> for &'a mut Header {
    #[inline]
    unsafe fn as_mut_unchecked<T: 'static>(self) -> &'a mut T {
        use std::raw::TraitObject;
        use std::cast::{transmute, transmute_copy};

        // Get the raw representation of the trait object
        let to: TraitObject = transmute_copy(&self);

        // Extract the data pointer
        transmute(to.data)
    }
}

/// Standard usage of this is a marker type, like this:
///
/// ```rust
/// // The header data type
/// pub struct Foo {
///     ...
/// }
///
/// impl Header for Foo {
///     ...
/// }
///
/// // The marker type for accessing the header and specifying the name
/// pub struct FOO;
///
/// impl HeaderMarker<Foo> for FOO {
///     fn header_name(&self) -> SendStr {
///         Slice("foo")
///     }
/// }
/// ```
///
/// Then, accessing the header is done like this:
///
/// ```rust
/// let foo = request.headers.get(FOO);
/// request.headers.set(FOO, foo);
/// ```
///
/// And lo! `foo` is a `Foo` object corresponding to the `foo` header in the request.
///
/// Authors are strongly advised that they should not implement `HeaderMarker<T>` for more than one
/// `T` on the same type.
pub trait HeaderMarker<OutputType: Header + 'static> {
    /// The name of the header that shall be used for retreiving and setting.
    ///
    /// Normally this will be a static string, but occasionally it may be necessary to define it at
    /// runtime, for dynamic header handling.
    fn header_name(&self) -> SendStr;
}

/// The data type for the ``expires`` header.
#[deriving(Clone, Eq, Show)]
pub enum Expires {
    /// The Expires header had an invalid format, which MUST be interpreted as “in the past”.
    Past,
    /// A valid Expires header date.
    ExpiresDate(Tm),
}

impl Header for Expires {
    fn parse_header(raw: &[Vec<u8>]) -> Option<Expires> {
        require_single_field!(raw);
        match Header::parse_header(raw) {
            Some(tm) => Some(ExpiresDate(tm)),
            None => Some(Past),
        }
    }

    fn fmt_header(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match *self {
            Past => write!(f.buf, "0"),
            ExpiresDate(ref tm) => tm.fmt_header(f),
        }
    }
}

impl Header for Tm {
    fn parse_header(raw: &[Vec<u8>]) -> Option<Tm> {
        let raw = require_single_field!(raw);
        let raw = match std::str::from_utf8(raw) {
            Some(raw) => raw,
            None => return None,
        };
        // XXX: %Z actually ignores any timezone other than UTC. Probably not a good idea?
        match strptime(raw, "%a, %d %b %Y %T %Z") {  // RFC 822, updated by RFC 1123
            Ok(time) => return Some(time),
            Err(_) => ()
        }

        match strptime(raw, "%A, %d-%b-%y %T %Z") {  // RFC 850, obsoleted by RFC 1036
            Ok(time) => return Some(time),
            Err(_) => ()
        }

        match strptime(raw, "%c") {  // ANSI C's asctime() format
            Ok(time) => Some(time),
            Err(_) => None,
        }
    }

    fn fmt_header(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f.buf, "{}", self.to_utc().strftime("%a, %d %b %Y %T GMT"))
    }
}

impl Header for Box<Header> {
    fn parse_header(_raw: &[Vec<u8>]) -> Option<Box<Header>> {
        // Dummy impl; XXX: split to ToHeader/FromHeader?
        None
    }

    fn fmt_header(&self, f: &mut fmt::Formatter) -> fmt::Result {
        self.fmt_header(f)
    }
}

impl<'a> Header for &'a Header {
    fn parse_header(_raw: &[Vec<u8>]) -> Option<&'a Header> {
        // Dummy impl; XXX: split to ToHeader/FromHeader?
        None
    }

    fn fmt_header(&self, f: &mut fmt::Formatter) -> fmt::Result {
        self.fmt_header(f)
    }
}

macro_rules! impl_header_from_fromstr {
    ($ty:ty) => (impl Header for $ty {
        fn parse_header(raw: &[Vec<u8>]) -> Option<$ty> {
            std::str::from_utf8(raw).and_then(from_str)
        }

        fn fmt_header(&self, f: &mut fmt::Formatter) -> fmt::Result {
            write!(f.buf, "{}", self)
        }
    })
}

macro_rules! header {
    ($struct_ident:ident, $header_name:expr, $output_type:ty) => (
        pub struct $struct_ident;

        impl HeaderMarker<$output_type> for $struct_ident {
            fn header_name(&self) -> SendStr {
                Slice($header_name)
            }
        }
    )
}

header!(EXPIRES, "expires", Expires)
header!(DATE, "date", Tm)

pub struct RawHeaderMarker {
    pub name: SendStr,
}

/// This grants you access to raw header values:
///
/// ```rust
/// # headers = Headers::new();
/// headers.set(raw("expires"), "bar".as_bytes().iter().map(|&x| x).collect());
/// assert_eq!(headers.get(raw("expires")).to_slice(), bytes!("bar"));
/// ```
///
/// That is, the header collection values are interpreted as `Vec<u8>`.
///
/// You should largely avoid doing this.
pub fn raw<S: IntoMaybeOwned<'static>>(s: S) -> RawHeaderMarker {
    RawHeaderMarker {
        name: s.into_maybe_owned()
    }
}

impl HeaderMarker<Vec<Vec<u8>>> for RawHeaderMarker {
    fn header_name(&self) -> SendStr {
        self.name.clone()
    }
}

impl Header for Vec<Vec<u8>> {
    fn parse_header(raw: &[Vec<u8>]) -> Option<Vec<Vec<u8>>> {
        Some(raw.iter().map(|x| x.clone()).collect())
    }

    fn fmt_header(&self, f: &mut fmt::Formatter) -> fmt::Result {
        let mut first = true;
        for field in self.iter() {
            try!(f.buf.write(field.as_slice()));
            if !first {
                try!(f.buf.write([',' as u8, ' ' as u8]));
            }
            first = false;
        }
        Ok(())
    }
}

enum Item {
    /// A raw, unparsed header. Uses the ISO-8859-1 character set, but could contain things in other
    /// character sets according to the rules of RFC 2047, e.g. in a *TEXT rule (RFC 2616 grammar).
    Raw(Vec<Vec<u8>>),

    /// A strongly typed header which has been parsed from the raw value.
    Typed(Box<Header>:'static),
}

/// A collection of HTTP headers.
struct Headers {
    data: HashMap<SendStr, Item>,
}

impl Headers {
    fn new() -> Headers {
        Headers {
            data: HashMap::new(),
        }
    }
}

/* I think I'll skip the Index implementation, though it works OK.
impl<H: Header + Clone + 'static, M: HeaderMarker<H>> Index<M, Option<H>> for Headers {
    fn index(&self, header_marker: &M) -> Option<H> {
        let mut_self = unsafe { std::cast::transmute_mut(self) };
        match mut_self.mostly_get(header_marker) {
            Some(&Typed(ref h)) => Some(unsafe { h.as_ref_unchecked::<H>() }.clone()),
            _ => None,
        }
    }
}
*/

impl<H: Header + Clone + 'static, M: HeaderMarker<H>> Headers {
    pub fn get<'a>(&'a mut self, header_marker: M) -> Option<H> {
        // At this point, we know that item is None, or Some(&mut Typed(h)) and that h.is::<H>().
        // On that basis, we can use as_ref_unchecked instead of as_ref, to save a virtual call.
        match self.mostly_get(&header_marker) {
            Some(&Typed(ref h)) => Some(unsafe { h.as_ref_unchecked::<H>() }.clone()),
            _ => None,
        }
    }
}

impl<H: Header + 'static, M: HeaderMarker<H>> Headers {
    fn mostly_get<'a>(&'a mut self, header_marker: &M) -> Option<&'a mut Item> {
        let name = header_marker.header_name();
        let item = match self.data.find_mut(&name) {
            // Yes, there's a header… or something… by that name
            Some(v) => v,
            // Header? There is no header!
            None => return None,
        };

        let (insert_parsed, parsed): (bool, Option<H>) = match *item {
            // We've parsed this header before, as some type or other.
            // Question is: is it the right type?

            // Yes, it's the right type, so we can immediately return it.
            // Well, we could, except that that makes the borrow checker keep the item borrow alive,
            // preventing the insert in the other cases. So we must refactor it to return at the
            // end instead.
            Typed(ref h) if h.is::<H>() => {
                (false, None)
            },

            // No, it was parsed as a different type.
            // Very well then, we will turn it back into a string first.
            Typed(ref h) => {
                let raw = fmt_header(h);
                (true, Header::parse_header([raw]))
            },

            // We haven't parsed it before, so let's have a go at that.
            Raw(ref raw) => (true, Header::parse_header(raw.as_slice())),
        };

        if insert_parsed {
            match parsed {
                // It parsed. Let's store the new value (replacing the raw one)
                Some(v) => *item = Typed(box v),
                // If the header doesn't parse, that's the same as it being absent.
                None => return None,
            }
        }
        Some(item)
    }

    pub fn get_ref<'a>(&'a mut self, header_marker: M) -> Option<&'a H> {
        match self.mostly_get(&header_marker) {
            Some(&Typed(ref h)) => Some(unsafe { h.as_ref_unchecked::<H>() }),
            _ => None,
        }
    }

    pub fn get_mut_ref<'a>(&'a mut self, header_marker: M) -> Option<&'a mut H> {
        match self.mostly_get(&header_marker) {
            Some(&Typed(ref mut h)) => Some(unsafe { h.as_mut_unchecked::<H>() }),
            _ => None,
        }
    }

    /// Set the named header to the given value.
    pub fn set(&mut self, header_marker: M, value: H) {
        self.data.insert(header_marker.header_name(), Typed(box value));
    }

    /// Remove a header from the collection.
    /// Returns true if the named header was present.
    pub fn remove(&mut self, header_marker: &M) {
        self.data.remove(&header_marker.header_name());
    }
}

/// An adapter which provides `std::fmt::Show` as equivalent to `Header.fmt_header`, so that you can
/// actually *use* the thing.
struct HeaderShowAdapter<'a, H>(pub &'a H);

impl<'a, H: Header> fmt::Show for HeaderShowAdapter<'a, H> {
    #[inline]
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        let HeaderShowAdapter(h) = *self;
        h.fmt_header(f)
    }
}

#[inline]
pub fn fmt_header<H: Header>(h: &H) -> Vec<u8> {
    format_args!(format_but_not_utf8, "{}", HeaderShowAdapter(h))
}

// Parallel to ::std::fmt::{format, format_unsafe}, but returning Vec<u8> rather than Box<str>.
#[inline]
#[doc(hidden)]
pub fn format_but_not_utf8(args: &fmt::Arguments) -> Vec<u8> {
    let mut output = std::io::MemWriter::new();
    fmt::write(&mut output as &mut Writer, args).unwrap();
    output.unwrap()
}

fn main() {
    let mut headers = Headers::new();
    fn expect<H: Header + std::fmt::Show + Eq>(h: Option<H>, h_expected: H, raw: &[u8]) {
        let h = h.unwrap();
        assert_eq!(fmt_header(&h).as_slice(), raw);
        assert_eq!(h, h_expected);
    }
    fn expect_none<H: Header>(h: Option<H>) {
        assert!(h.is_none());
    }

    expect_none(headers.get(EXPIRES));
    headers.set(EXPIRES, Past);
    expect(headers.get(EXPIRES), Past, bytes!("0"));
    expect(headers.get(raw("expires")), vec!(vec!('0' as u8)), bytes!("0"));
    expect(headers.get(EXPIRES), Past, bytes!("0"));
    headers.remove(&EXPIRES);
    expect_none(headers.get(EXPIRES));

    expect_none(headers.get(DATE));
    let now = time::now();
    let now_raw = fmt_header(&now);
    headers.set(DATE, now.clone());
    expect(headers.get(DATE), now.clone(), now_raw.as_slice());
}

Assessment

Here’s how this goes on the criteria listed:

Performance must be “good”.
The good news: headers are parsed lazily. The bad news: two or three allocations per header field, plus an allocation and a deallocation for every type change (including that lazy parsing). Performance is, I think, the weakest part of this scheme, but I don’t believe anything faster is possible while still ticking all the other boxes.
Ergonomics for using standardised headers must be “good”.
There’s a certain perfection about field access… but this will do dandy.
Ergonomics for using extension headers must be “good”.
As above.
Extension headers should be strongly typed.
Got it.
Badly formatted headers must be accessible in some way.
Readily accessible, though accessing a header as a strongly typed value will destroy the raw stuff, so anyone caring about such things will need to be careful.
A shared access technique for both standardised and extension headers is desirable (for backwards compatibility as more headers are added to the Teepee libraries), especially if it allows headers to be used as multiple types.
This solution absolutely shines at this: if two libraries want to access the same header as a different type, it will happily convert between them, just sacrificing a little bit of performance for the benefit.

Unless anyone finds any significant problems with this scheme, or a better scheme, Teepee will be using this type of header collection.

Remaining questions

Should access to the raw values be granted via unsafe functions, or should typed access be genuinely the only interface exposed, leaving an implementation for Vec<Vec<u8>> as the way of accessing raw values? (Side note: without that, the library will accept multiple header fields with the same name, but will merge them separated by a comma when writing them. This is perfectly spec‐compliant behaviour, but it is conceivable that someone may wish to genuinely write multiple header fields with the same name and not merge them. I guess I probably just talked myself into answering this question. Still, without such behaviour, the header collections for writing could skip the possibility of a raw value and just use a Box<Header> directly instead of Item, which would produce moderately trivial performance improvements. So I guess the quesiton is still open. Can anyone think of a way that one could use the std::fmt interface and yet still allow the writing of multiple header fields from the one value?)
Is jumping on the tails of std::fmt a good idea? (Bear in mind that std::fmt is UTF-8‐centric while HTTP is ISO-8859-1‐centric, though I do not really know if that influences thing.)
At present, the header collections for requests and responses have different fields; should the header markers use phantom types to specify that they are only legal for requests or responses (or both)? That would make it so that request.headers.set(REFERER, …) would work while response.headers.set(REFERER, …) would fail to compile.

I now invite you to join in the discussion at /r/rust.