Parsing HTML page content in a stream with hyper and html5ever

Sorry for the lack of tutorial-like documentation for html5ever and tendril…

Unless you’re 100% sure your content is in UTF-8, use from_bytes rather than from_utf8. They return something that implements TendrilSink which allows you to provide the input incrementally (or not).

The std::io::Read::read_to_end method takes a &mut Vec<u8>, so it doesn’t work with TendrilSink.

At the lowest level, you can call the TendrilSink::process method once per &[u8] chunk, and then call TendrilSink::finish.

To avoid doing that manually, there’s also the TendrilSink::read_from method that takes &mut R where R: std::io::Read. Since hyper::client::Response implements Read, you can use:

parse_document(RcDom::default(), Default::default()).from_bytes().read_from(&mut res)

To go beyond your question, RcDom is very minimal and mostly exists in order to test html5ever. I recommend using Kuchiki instead. It has more features (for tree traversal, CSS Selector matching, …) including optional Hyper support.

In your Cargo.toml:

[dependencies]
kuchiki = {version = "0.3.1", features = ["hyper"]}

In your code:

let document = kuchiki::parse_html().from_http(res).unwrap();

Unless I'm misunderstanding something, processing the HTML tokens is quite involved (and the names of the atom constants are unfortunately very far from perfect). This code demonstrates how to use html5ever version 0.25.1 to process the tokens.

First, we want a String with the HTML body:

let body = {
    let mut body = String::new();
    let client = Client::new();

    client.post(WEBPAGE)
        .header(ContentType::form_url_encoded())
        .body(BODY)
        .send()?
        .read_to_string(&mut body);

    body
};

Second, we need to define our own Sink, which contains the "callbacks" and lets you hold any state you need. For this example, I will be detecting <a> tags and printing them back as HTML (this requires us to detect start tag, end tag, text, and finding an attribute; hopefully a complete-enough example):

use html5ever::tendril::StrTendril;
use html5ever::tokenizer::{
    BufferQueue, Tag, TagKind, Token, TokenSink, TokenSinkResult, Tokenizer,
};
use html5ever::{ATOM_LOCALNAME__61 as TAG_A, ATOM_LOCALNAME__68_72_65_66 as ATTR_HREF};

// Define your own `TokenSink`. This is how you keep state and your "callbacks" run.
struct Sink {
    text: Option<String>,
}

impl TokenSink for Sink {
    type Handle = ();

    fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<()> {
        match token {
            Token::TagToken(Tag {
                kind: TagKind::StartTag,
                name,
                self_closing: _,
                attrs,
            }) => match name {
                // Check tag name, attributes, and act.
                TAG_A => {
                    let url = attrs
                        .into_iter()
                        .find(|a| a.name.local == ATTR_HREF)
                        .map(|a| a.value.to_string())
                        .unwrap_or_else(|| "".to_string());

                    print!("<a href=\"{}\">", url);
                    self.text = Some(String::new());
                }
                _ => {}
            },
            Token::TagToken(Tag {
                kind: TagKind::EndTag,
                name,
                self_closing: _,
                attrs: _,
            }) => match name {
                TAG_A => {
                    println!(
                        "{}</a>",
                        self.text.take().unwrap()
                    );
                }
                _ => {}
            },
            Token::CharacterTokens(string) => {
                if let Some(text) = self.text.as_mut() {
                    text.push_str(&string);
                }
            }
            _ => {}
        }
        TokenSinkResult::Continue
    }
}


let sink = {
    let sink = Sink {
        text: None,
    };

    // Now, feed the HTML `body` string to the tokenizer.
    // This requires a bit of setup (buffer queue, tendrils, etc.).
    let mut input = BufferQueue::new();
    input.push_back(StrTendril::from_slice(&body).try_reinterpret().unwrap());
    let mut tok = Tokenizer::new(sink, Default::default());
    let _ = tok.feed(&mut input);
    tok.end();
    tok.sink
};

// `sink` is your `Sink` after all processing was done.
assert!(sink.text.is_none());