A rust language project for parsing, filtering, selecting and serializing HTML and XML mark-up.
See the marked crate or marked-cli crates or the README(s) and CHANGELOG(s) under this (github hosted) source tree and cargo workspace.
Currently implemented features:
The marked::Document is a DOM-like tree structure suitable for HTML and
XML. This was forked from the victor project (same author as html5ever)
and further optimized. It is implemented as a (std) Vec of Node types,
which references parent, siblings and children via (std) NonZeroU32 indexes
for space efficiency.
Including HTML5 document and fragment parsing and HTML5 serialization (mark-up
output). With the marked::Document (DOM), parsing and serialization is
measurably faster (see benchmarks in source tree) than the RcDom previously
included with html5ever associated crates, and mutating the Document is
more straightforward, via a mutable reference.
Strict, UTF-8 XML parsing to marked::Document is currently supported by
integration of the xml-rs crate.
An estimated 5% of the web remains in encodings other than UTF-8; too common to
be treated as an error. Via marked::html::parse_buffered:
-
Decoding via encoding_rs which implements The Encoding Standard including alternative names (labels) for supported encodings.
-
HTML5 parsing restart from initial (4k) buffer with new encoding hints obtained from <head>/<meta>
charsetor anhttp-equivcontent-typewith charset. -
Byte-Order-Mark BOM sniffing as high priority
EncodingHintfor UTF-8, UTF-16 Big-Endian and UTF-16 Little-Endian. -
"Impossible" hints from the above are ignored. For example, if we read a hint from UTF-8 that says its UTF-16LE (which would make it impossible to read the same hint if it was used).
(Note that the detection features are not currently provided by html5ever and associated crates.)
A NodeRef type with "CSS selectors"-like methods to recursively select and
find elements using closure predicates. We prefer direct rust language
compiler support for writing such selection logic, over CSS or other
interpreted DSL.
See marked::html::t (tags) and marked::html::a (attributes) modules.
Bulk modifications to the DOM is easily and efficiently achieved with mutating
filter functions/closures and a tree walker (depth or breadth-first)
implementation in marked. This style of interface is sometimes called the
"visitor pattern". See Document::filter_at for details. The crate also
includes the following built-in filters (a partial list):
detach_banned_element
: Detach known banned (via metadata) and unknown elements
retain_basic_attributes
: Remove all attributes that are not part of the "basic" logical set (via metadata)
fold_empty_inline
: Fold empty or meaninglessly "inline" elements
text_normalize
: Normalize text nodes by merging, replacing control characters and minimizing white-space.
An unreleased example, compatibility test and benchmark of ammonia crate equivalent filtering (for hygiene and safety) is included in the source tree (./ammonia-compare)
Features incomplete or unstarted which may be included in this project in the future (PRs welcome):
-
Complete (faster, more correct, legacy encodings) strict-mode XML parsing
-
Lenient-mode XML parsing
-
Optional (opt-in) direct charset detection (initial read buffer or entire document) via something like chardet, integrated as high priority EncodingHint.
-
XML/HTML pretty-indenting serialization (combines well with the existing white-space normalization features)
-
XML (and XHTML) serialization
This project is dual licensed under either of following:
-
The Apache License, version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
-
The MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the märkəd project by you, as defined by the Apache License, shall be dual licensed as above, without any additional terms or conditions.