Skip to content

Conversation

davexunit
Copy link

At Spritely, we'd really like it if we could embed arbitrary HTML in our Markdown files that we use in our Haunt website. It's also a longstanding issue with guile-commonmark first reported in 2018: #8

The fundamental difficulty, as I understand it, is that since the CommonMark format allows embedding any arbitrary HTML (even garbage), the resulting CommonMark AST does not necessarily reflect the shape of the HTML node tree. So, you cannot directly convert a CommonMark AST to SXML when block/inline HTML nodes are present. You have to serialize to HTML first and then use an HTML to SXML parser.

This pull request does the following:

  1. Adds new html-block and inline-html node types in (commonmark node).

  2. Adds support for parsing block and inline HTML to (commonmark blocks) and (commonmark inlines).

  3. Adds support for direct conversion of CommonMark AST to HTML text with a new commonmark->html procedure in a new (commonmark html) module.

  4. For compatibility with existing behavior, HTML nodes are converted to simple text nodes in commonmark->sxml, which means they will be escaped in the output as if they weren't parsed in the first place.

I think item 4 is particularly important because it will allow guile-commonmark to continue to work as it does today, without support for embedded HTML. The new commonmark->html interface will allow users to directly serialize to HTML (which is enough for many use-cases) or use their preferred HTML parser to convert it to SXML, such as guile-lib's (htmlprag) (which is what I'd want to do with Haunt). This avoids adding dependencies to guile-commonmark and punts on the complicated subject of HTML parsing.

The test suite file I added incorporates all 64 tests of inline/block HTML included in the CommonMark specification. Additionally, I tested that my fork of guile-commonmark can successfully parse all of the existing Spritely blog posts, serialize them to HTML, and then parse them again using html->shtml in (htmlprag).

(The test suite in general is not green, though. There are tests failing on master. I have not made the situation worse, in any case.)

@podiki podiki mentioned this pull request Apr 27, 2025
@Flurando
Copy link

Flurando commented May 6, 2025

Is there any plan or existing review to merge this?
inline html is really important for markdown parsing
can you do this in guile-commonmark? no, you can't!
if not, I might have to try compiling the wip-html branch from the fork
or switch to skribe.

@ZelphirKaltstahl
Copy link

I had a conceptually maybe similar case once and I had an idea for an alternative to embedding something like HTML in a markdown document: Using composition instead, by writing a little metadata file, that lists various files, that are composed into one result. That way each document could be written in only one language and an appropriate parser be used.

This case here might be different, because the embedded thing is really HTML and HTML is defined as embeddable in markdown. But I wanted to mention this idea of composing files instead, in case it could be a solution for your use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants