Skip to content

Conversation

datalogics-jacksonm
Copy link
Contributor

@datalogics-jacksonm datalogics-jacksonm commented May 29, 2025

#145

Implements feature as requested above (more or less). Drastically reduces the amount of characters in the output, making the output much more friendly to LLMs due to their finite context windows.

Essentially creates every table with the minimum amount of characters necessary to still be valid GFM Tables.

Default output should remain the same. New flag is --opt-table-cell-padding="[value]", where the value can be aligned, minimal, or none (default is aligned).

The minimal value just preserves the spaces before and after each cell's contents. Balances readability with reducing the character count.

This option will allow the user to specify if tables should include the
visual padding or just include the minimum required amount of characters to
render properly as a table.

Addresses JohannesKaufmann#145
Adds conditions to the logic that writes the extra padding characters.
Doesn't write the extra characters if the `padColumns` option is false.
Adds flag to the CLI for the new `padColumns` option. Default value
should be true to preserve the original expected behavior of
html2mardown.
With bool, the way the logic is constructed means there's not an easy
way to have the option default to true. Would need to check if the flag
was set in the CLI, then if not (and if plugin-table was enabled), set
the value to true.

It's just way easier to do a string and default to "on", so that's how
we're doing things!
I'm not too familiar with Go, so forgive me if I made a bunch of changes
I wasn't supposed to here.

When compiling and running it from the command line, the option would
correctly default to the value of "on" in my manual testing. However, I
couldn't for the life of me get it to behave like that in the tests
without specifically going into each one's options and explicltly adding
`WithPadColumns(PadColumnBehaviorOn)`.

I assumed it would default without that, but evidently not, and this
was the only way that the tests would work as expected.
@datalogics-jacksonm datalogics-jacksonm force-pushed the minimum-required-characters-for-tables branch from 2840933 to 1526af8 Compare May 29, 2025 21:45
There would always be 2 spaces between `|` characters if there was no
text in the cell. Now it checks if the padding option is turned on, and
if it isn't then the spaces aren't necessary for this to render properly
@datalogics-jacksonm datalogics-jacksonm force-pushed the minimum-required-characters-for-tables branch from 1526af8 to e2ea5d9 Compare May 29, 2025 21:57
Now when the padding option is off, it won't include any spaces in
between `|` characters, since they aren't necessary for the output to
render properly.
Now we have "on", "off", and "some".

"some" will add a space at the beginning and end of each cell to balance
between readability while still trying to minimize token count.
@JohannesKaufmann
Copy link
Owner

@datalogics-jacksonm Thanks for creating a PR!

Looks good so far 👍

However I am not sure about the on/off/some naming. That is not that descriptive.

Can you brainstorm some other namings? For example, a cell padding of aligned/minimal/none.

WithCellPaddingBehavior(padding CellPaddingBehavior)
OR
WithCellPadding(padding CellPaddingBehavior)


type CellPaddingBehavior string

const (
    CellPaddingAligned CellPaddingBehavior = "aligned"
    
    CellPaddingNormal CellPaddingBehavior = "minimal" // or "normal"
    
    CellPaddingCompact CellPaddingBehavior = "none" // or "dense" or "compact"
)

@datalogics-jacksonm
Copy link
Contributor Author

Totally agree, the naming convention was what I was most unsure about - I'll make that adjustment, thanks for the suggestion!

JohannesKaufmann#161 (comment)

Adjusted naming convention from the old "PadColumns" with options "on",
"off", and "some" to the new "CellPadding" with options "aligned",
"minimal", and "none".
@datalogics-jacksonm
Copy link
Contributor Author

datalogics-jacksonm commented Jun 3, 2025

Also figured I'd show some real-world performance here!

Screenshot 2025-06-03 at 10 03 09 AM

Take this table as an example. With the default behavior of html2markdown (and with --table-opt-newline-behavior="preserve"), the output we get looks like this:

|            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                                                  |                                                                               |
|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|-------------------------------------------------------------------------------|
| downsample | Ability to specify a target resolution and a trigger resolution at which monochrome images will be recompressed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                  |                                                                               |
|            | trigger-dpi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | All monochrome images above this resolution will be downsampled. |                                                                               |
|            | target-dpi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | The new resolution of resampled monochrome images.               |                                                                               |
| recompress | Sets the type and quality of compression used to downsample monochrome images.<br /><br /><br /><br />JBIG2 is a compression algorithm designed for binary images, or images where each pixel can only have one of two possible colors. It can be used for either lossy or lossless image processing.<br /><br /><br /><br />CCITT Group 4 refers to the compression type from the International Telegraph and Telephone Consultative Committee (CCITT) or TIU. Many fax and document imaging file formats support this form of lossless data compression encoding. These protocols are referred to as CCITT Group 3 and Group 4 compression, respectively.<br /><br /><br /><br />Lossy and lossless refer to the approach used for compressing data. For lossless, all of the data in the image is preserved. The quality of the image does not change, and it can be uncompressed to its original state. Lossy compression permanently removes data from the image file, such as pixels, reducing the image resolution. Files reduced using lossy compression will be considerably smaller, but will not print or display as well as those compressed using lossless compression. |                                                                  |                                                                               |
|            | type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | same                                                             | Keep original default compression algorithm provided in the images themselves |
|            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | jbig2                                                            | Use jbig2 compression                                                         |
|            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | ccittg3                                                          | Use ccittg3 compression                                                       |
|            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | ccittg4                                                          | Use ccittg4 compression                                                       |
|            | quality                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | lossy                                                            | Valid for jbig2 only                                                          |
|            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | lossless                                                         | Valid for jbig2 only                                                          |
|            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                                                  |                                                                               |

And now with the --opt-table-cell-padding="none" flag, the output turns into this:

|||||
|---|---|---|---|
|downsample|Ability to specify a target resolution and a trigger resolution at which monochrome images will be recompressed.|||
||trigger-dpi|All monochrome images above this resolution will be downsampled.||
||target-dpi|The new resolution of resampled monochrome images.||
|recompress|Sets the type and quality of compression used to downsample monochrome images.<br /><br /><br /><br />JBIG2 is a compression algorithm designed for binary images, or images where each pixel can only have one of two possible colors. It can be used for either lossy or lossless image processing.<br /><br /><br /><br />CCITT Group 4 refers to the compression type from the International Telegraph and Telephone Consultative Committee (CCITT) or TIU. Many fax and document imaging file formats support this form of lossless data compression encoding. These protocols are referred to as CCITT Group 3 and Group 4 compression, respectively.<br /><br /><br /><br />Lossy and lossless refer to the approach used for compressing data. For lossless, all of the data in the image is preserved. The quality of the image does not change, and it can be uncompressed to its original state. Lossy compression permanently removes data from the image file, such as pixels, reducing the image resolution. Files reduced using lossy compression will be considerably smaller, but will not print or display as well as those compressed using lossless compression.|||
||type|same|Keep original default compression algorithm provided in the images themselves|
|||jbig2|Use jbig2 compression|
|||ccittg3|Use ccittg3 compression|
|||ccittg4|Use ccittg4 compression|
||quality|lossy|Valid for jbig2 only|
|||lossless|Valid for jbig2 only|
|||||

Not really very readable for humans, but it's fully able to be rendered as markdown. Dropping both outputs into https://gpt-tokenizer.dev/, you'll get:

  • Table with cell padding aligned: 16,964 characters | 577 tokens
  • Table with cell padding none: 1,727 characters | 395 tokens

So we get a reduction in tokens of ~32%. That is, of course in stark opposition to the whopping ~90% reduction in characters! This is just due to the optimizations of the o200k_base encoding that a majority of LLMs use. Long strings of ------ and are encoded as singular tokens, despite their large amount of characters. Despite this, I still think that a ~30% reduction in tokens (even if just for one aspect of an HTML document) can be a worthwhile optimization.


And a fun thing I didn't expect, if we take the output of --opt-table-cell-padding="minimal":

|  |  |  |  |
|---|---|---|---|
| downsample | Ability to specify a target resolution and a trigger resolution at which monochrome images will be recompressed. |  |  |
|  | trigger-dpi | All monochrome images above this resolution will be downsampled. |  |
|  | target-dpi | The new resolution of resampled monochrome images. |  |
| recompress | Sets the type and quality of compression used to downsample monochrome images.<br /><br /><br /><br />JBIG2 is a compression algorithm designed for binary images, or images where each pixel can only have one of two possible colors. It can be used for either lossy or lossless image processing.<br /><br /><br /><br />CCITT Group 4 refers to the compression type from the International Telegraph and Telephone Consultative Committee (CCITT) or TIU. Many fax and document imaging file formats support this form of lossless data compression encoding. These protocols are referred to as CCITT Group 3 and Group 4 compression, respectively.<br /><br /><br /><br />Lossy and lossless refer to the approach used for compressing data. For lossless, all of the data in the image is preserved. The quality of the image does not change, and it can be uncompressed to its original state. Lossy compression permanently removes data from the image file, such as pixels, reducing the image resolution. Files reduced using lossy compression will be considerably smaller, but will not print or display as well as those compressed using lossless compression. |  |  |
|  | type | same | Keep original default compression algorithm provided in the images themselves |
|  |  | jbig2 | Use jbig2 compression |
|  |  | ccittg3 | Use ccittg3 compression |
|  |  | ccittg4 | Use ccittg4 compression |
|  | quality | lossy | Valid for jbig2 only |
|  |  | lossless | Valid for jbig2 only |
|  |  |  |  |

And throw that into https://gpt-tokenizer.dev/, we get the numbers 1,823 characters, and 443 tokens.

That means with just these few extra spaces, we go from a ~32% reduction down to a ~24% reduction in tokens! That's only a 5.5% increase in characters corresponding to a 12% increase in tokens. I didn't expect that to be so drastic, and on one table I tried, on the minimal setting it even had MORE tokens than the default aligned setting's output! Pretty interesting outcome.


I think in general, none will always outperform aligned when it comes to token count, but minimal will change on a case-by-case basis.

Copy link
Owner

@JohannesKaufmann JohannesKaufmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, mostly setting the default and consistent naming

switch behavior {
case "":
// Allow empty string to default to "aligned"
p.cellPaddingBehavior = CellPaddingBehavior(CellPaddingAligned)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapping with CellPaddingBehavior() is not needed here since it has already the correct type.

commonmark.NewCommonmarkPlugin(),
NewTablePlugin(),
NewTablePlugin(
WithCellPadding(CellPaddingAligned),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove that since we want to test the default here.

Instead we can set a default:

func NewTablePlugin(opts ...option) converter.Plugin {
	plugin := &tablePlugin{
		cellPaddingBehavior: CellPaddingAligned,
	}

commonmark.NewCommonmarkPlugin(),
NewTablePlugin(
WithSpanCellBehavior("random"),
WithCellPadding(CellPaddingAligned),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the default set, adding it to every testcase is not nessesary anymore.

}

func TestOptionFunc_PadColumns(t *testing.T) {
testCases := []struct {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small thing: Can we always follow the order of 1) aligned 2) minimal and 3) none to keep the consistency.

}
}

func TestOptionFunc_PadColumns(t *testing.T) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to TestOptionFunc_CellPadding

CellPaddingNone CellPaddingBehavior = "none"
)

// WithPadColumns configures how to handle padding in table cells.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to WithCellPadding

return nil

default:
return fmt.Errorf("unknown value %q for pad columns behavior", behavior)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update naming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants