The fast & forgiving HTML/XML parser.
htmlparser2 is the fastest HTML parser, and takes some shortcuts to get there. If you need strict HTML spec compliance, have a look at parse5.
npm install htmlparser2
A live demo of htmlparser2 is available on AST Explorer.
| Name | Description |
|---|---|
| htmlparser2 | Fast & forgiving HTML/XML parser |
| domhandler | Handler for htmlparser2 that turns documents into a DOM |
| domutils | Utilities for working with domhandler's DOM |
| css-select | CSS selector engine, compatible with domhandler's DOM |
| cheerio | The jQuery API for domhandler's DOM |
| dom-serializer | Serializer for domhandler's DOM |
htmlparser2 itself provides a callback interface that allows consumption of documents with minimal allocations.
For a more ergonomic experience, read Getting a DOM below.
import * as htmlparser2 from "htmlparser2";
const parser = new htmlparser2.Parser({
onopentag(name, attributes) {
/*
* This fires when a new tag is opened.
*
* If you don't need an aggregated `attributes` object,
* have a look at the `onopentagname` and `onattribute` events.
*/
if (name === "script" && attributes.type === "text/javascript") {
console.log("JS! Hooray!");
}
},
ontext(text) {
/*
* Fires whenever a section of text was processed.
*
* Note that this can fire at any point within text and you might
* have to stitch together multiple pieces.
*/
console.log("-->", text);
},
onclosetag(tagname) {
/*
* Fires when a tag is closed.
*
* You can rely on this event only firing when you have received an
* equivalent opening tag before. Closing tags without corresponding
* opening tags will be ignored.
*/
if (tagname === "script") {
console.log("That's it?!");
}
},
});
parser.write(
"Xyz <script type='text/javascript'>const foo = '<<bar>>';</script>",
);
parser.end();Output (with multiple text events combined):
--> Xyz
JS! Hooray!
--> const foo = '<<bar>>';
That's it?!
All callbacks are optional. The handler object you pass to Parser may implement any subset of these:
| Event | Description |
|---|---|
onopentag(name, attribs, isImplied) |
Opening tag. attribs is an object mapping attribute names to values. isImplied is true when the tag was opened implicitly (HTML mode only). |
onopentagname(name) |
Emitted for the tag name as soon as it is available (before attributes are parsed). |
onattribute(name, value, quote) |
Attribute. quote is " / ' / null (unquoted) / undefined (no value, e.g. disabled). |
onclosetag(name, isImplied) |
Closing tag. isImplied is true when the tag was closed implicitly (HTML mode only). |
ontext(data) |
Text content. May fire multiple times for a single text node. |
oncomment(data) |
Comment (content between <!-- and -->). |
oncdatastart() |
Opening of a CDATA section (<![CDATA[). |
oncdataend() |
End of a CDATA section (]]>). |
onprocessinginstruction(name, data) |
Processing instruction (e.g. <?xml ...?>). |
oncommentend() |
Fires after a comment has ended. |
onparserinit(parser) |
Fires when the parser is initialized or reset. |
onreset() |
Fires when parser.reset() is called. |
onend() |
Fires when parsing is complete. |
onerror(error) |
Fires on error. |
| Option | Type | Default | Description |
|---|---|---|---|
xmlMode |
boolean |
false |
Treat the document as XML. This affects entity decoding, self-closing tags, CDATA handling, and more. Set this to true for XML, RSS, Atom and RDF feeds. |
decodeEntities |
boolean |
true |
Decode HTML entities (e.g. & -> &). |
lowerCaseTags |
boolean |
!xmlMode |
Lowercase tag names. |
lowerCaseAttributeNames |
boolean |
!xmlMode |
Lowercase attribute names. |
recognizeSelfClosing |
boolean |
xmlMode |
Recognize self-closing tags (e.g. <br/>). Always enabled in xmlMode. |
recognizeCDATA |
boolean |
xmlMode |
Recognize CDATA sections as text. Always enabled in xmlMode. |
While the Parser interface closely resembles Node.js streams, it's not a 100% match.
Use the WritableStream interface to process a streaming input:
import { WritableStream } from "htmlparser2/WritableStream";
const parserStream = new WritableStream({
ontext(text) {
console.log("Streaming:", text);
},
});
const htmlStream = fs.createReadStream("./my-file.html");
htmlStream.pipe(parserStream).on("finish", () => console.log("done"));The parseDocument helper parses a string and returns a DOM tree (a Document node).
import * as htmlparser2 from "htmlparser2";
const dom = htmlparser2.parseDocument(
`<ul id="fruits">
<li class="apple">Apple</li>
<li class="orange">Orange</li>
</ul>`,
);parseDocument accepts an optional second argument with both parser and DOM handler options:
const dom = htmlparser2.parseDocument(data, {
// Parser options
xmlMode: true,
// domhandler options
withStartIndices: true, // Add `startIndex` to each node
withEndIndices: true, // Add `endIndex` to each node
});The DomUtils module (re-exported on the main htmlparser2 export) provides helpers for finding nodes:
import * as htmlparser2 from "htmlparser2";
const dom = htmlparser2.parseDocument(`<div><p id="greeting">Hello</p></div>`);
// Find elements by ID, tag name, or class
const greeting = htmlparser2.DomUtils.getElementById("greeting", dom);
const paragraphs = htmlparser2.DomUtils.getElementsByTagName("p", dom);
// Find elements with custom test functions
const all = htmlparser2.DomUtils.findAll(
(el) => el.attribs?.class === "active",
dom,
);
// Get text content
htmlparser2.DomUtils.textContent(greeting); // "Hello"For CSS selector queries, use css-select:
import { selectAll, selectOne } from "css-select";
const results = selectAll("ul#fruits > li", dom);
const first = selectOne("li.apple", dom);Or, if you'd prefer a jQuery-like API, use cheerio.
Use DomUtils to modify the tree, and dom-serializer (also available as DomUtils.getOuterHTML) to serialize it back to HTML:
import * as htmlparser2 from "htmlparser2";
const dom = htmlparser2.parseDocument(
`<ul><li>Apple</li><li>Orange</li></ul>`,
);
// Remove the first <li>
const items = htmlparser2.DomUtils.getElementsByTagName("li", dom);
htmlparser2.DomUtils.removeElement(items[0]);
// Serialize back to HTML
const html = htmlparser2.DomUtils.getOuterHTML(dom);
// "<ul><li>Orange</li></ul>"Other manipulation helpers include appendChild, prependChild, append, prepend, and replaceElement -- see the domutils docs for the full API.
htmlparser2 makes it easy to parse RSS, RDF and Atom feeds, by providing a parseFeed method:
const feed = htmlparser2.parseFeed(content);This returns an object with type, title, link, description, updated, author, and items (an array of feed entries), or null if the document isn't a recognized feed format.
The xmlMode option is enabled by default for parseFeed. If you pass custom options, make sure to include xmlMode: true.
After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark, which benchmarks HTML parses based on real-world websites.
At the time of writing, the latest versions of all supported parsers show the following performance characteristics on GitHub Actions (sourced from here):
htmlparser2 : 2.17215 ms/file ± 3.81587
node-html-parser : 2.35983 ms/file ± 1.54487
html5parser : 2.43468 ms/file ± 2.81501
neutron-html5parser: 2.61356 ms/file ± 1.70324
htmlparser2-dom : 3.09034 ms/file ± 4.77033
html-dom-parser : 3.56804 ms/file ± 5.15621
libxmljs : 4.07490 ms/file ± 2.99869
htmljs-parser : 6.15812 ms/file ± 7.52497
parse5 : 9.70406 ms/file ± 6.74872
htmlparser : 15.0596 ms/file ± 89.0826
html-parser : 28.6282 ms/file ± 22.6652
saxes : 45.7921 ms/file ± 128.691
html5 : 120.844 ms/file ± 153.944
To report a security vulnerability, please use the Tidelift security contact. Tidelift will coordinate the fix and disclosure.