The Märkəd Project

A rust language project for parsing, filtering, selecting and serializing HTML and XML mark-up.

See the marked crate or marked-cli crates or the README(s) and CHANGELOG(s) under this (github hosted) source tree and cargo workspace.

Feature Overview

Currently implemented features:

A vector-allocated, indexed, DOM-like tree structure

The marked::Document is a DOM-like tree structure suitable for HTML and XML. This was forked from the victor project (same author as html5ever) and further optimized. It is implemented as a (std) Vec of Node types, which references parent, siblings and children via (std) NonZeroU32 indexes for space efficiency.

html5ever integration

Including HTML5 document and fragment parsing and HTML5 serialization (mark-up output). With the marked::Document (DOM), parsing and serialization is measurably faster (see benchmarks in source tree) than the RcDom previously included with html5ever associated crates, and mutating the Document is more straightforward, via a mutable reference.

xml-rs integration

Strict, UTF-8 XML parsing to marked::Document is currently supported by integration of the xml-rs crate.

Legacy character encoding support

An estimated 5% of the web remains in encodings other than UTF-8; too common to be treated as an error. Via marked::html::parse_buffered:

Decoding via encoding_rs which implements The Encoding Standard including alternative names (labels) for supported encodings.
HTML5 parsing restart from initial (4k) buffer with new encoding hints obtained from <head>/<meta> charset or an http-equiv content-type with charset.
Byte-Order-Mark BOM sniffing as high priority EncodingHint for UTF-8, UTF-16 Big-Endian and UTF-16 Little-Endian.
"Impossible" hints from the above are ignored. For example, if we read a hint from UTF-8 that says its UTF-16LE (which would make it impossible to read the same hint if it was used).

(Note that the detection features are not currently provided by html5ever and associated crates.)

Rust "selectors" API

A NodeRef type with "CSS selectors"-like methods to recursively select and find elements using closure predicates. We prefer direct rust language compiler support for writing such selection logic, over CSS or other interpreted DSL.

HTML tag and attribute metadata

See marked::html::t (tags) and marked::html::a (attributes) modules.

Tree walking filters API

Bulk modifications to the DOM is easily and efficiently achieved with mutating filter functions/closures and a tree walker (depth or breadth-first) implementation in marked. This style of interface is sometimes called the "visitor pattern". See Document::filter_at for details. The crate also includes the following built-in filters (a partial list):

detach_banned_element : Detach known banned (via metadata) and unknown elements

retain_basic_attributes : Remove all attributes that are not part of the "basic" logical set (via metadata)

fold_empty_inline : Fold empty or meaninglessly "inline" elements

text_normalize : Normalize text nodes by merging, replacing control characters and minimizing white-space.

An unreleased example, compatibility test and benchmark of ammonia crate equivalent filtering (for hygiene and safety) is included in the source tree (./ammonia-compare)

Roadmap

Features incomplete or unstarted which may be included in this project in the future (PRs welcome):

Complete (faster, more correct, legacy encodings) strict-mode XML parsing
Lenient-mode XML parsing
Optional (opt-in) direct charset detection (initial read buffer or entire document) via something like chardet, integrated as high priority EncodingHint.
XML/HTML pretty-indenting serialization (combines well with the existing white-space normalization features)
XML (and XHTML) serialization

License

This project is dual licensed under either of following:

The Apache License, version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
The MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the märkəd project by you, as defined by the Apache License, shall be dual licensed as above, without any additional terms or conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 515 Commits
.github/workflows		.github/workflows
ammonia-compare		ammonia-compare
marked-cli		marked-cli
marked-sanitizer		marked-sanitizer
marked		marked
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

The Märkəd Project

Feature Overview

A vector-allocated, indexed, DOM-like tree structure

html5ever integration

xml-rs integration

Legacy character encoding support

Rust "selectors" API

HTML tag and attribute metadata

Tree walking filters API

Roadmap

License

Contribution

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Licenses found

joshstoik1/marked

Folders and files

Latest commit

History

Repository files navigation

The Märkəd Project

Feature Overview

A vector-allocated, indexed, DOM-like tree structure

html5ever integration

xml-rs integration

Legacy character encoding support

Rust "selectors" API

HTML tag and attribute metadata

Tree walking filters API

Roadmap

License

Contribution

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages