-
Notifications
You must be signed in to change notification settings - Fork 162
plugin(table): Add option to remove padding from tables #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
plugin(table): Add option to remove padding from tables #161
Conversation
This option will allow the user to specify if tables should include the visual padding or just include the minimum required amount of characters to render properly as a table. Addresses JohannesKaufmann#145
Adds conditions to the logic that writes the extra padding characters. Doesn't write the extra characters if the `padColumns` option is false.
Adds flag to the CLI for the new `padColumns` option. Default value should be true to preserve the original expected behavior of html2mardown.
With bool, the way the logic is constructed means there's not an easy way to have the option default to true. Would need to check if the flag was set in the CLI, then if not (and if plugin-table was enabled), set the value to true. It's just way easier to do a string and default to "on", so that's how we're doing things!
I'm not too familiar with Go, so forgive me if I made a bunch of changes I wasn't supposed to here. When compiling and running it from the command line, the option would correctly default to the value of "on" in my manual testing. However, I couldn't for the life of me get it to behave like that in the tests without specifically going into each one's options and explicltly adding `WithPadColumns(PadColumnBehaviorOn)`. I assumed it would default without that, but evidently not, and this was the only way that the tests would work as expected.
2840933
to
1526af8
Compare
There would always be 2 spaces between `|` characters if there was no text in the cell. Now it checks if the padding option is turned on, and if it isn't then the spaces aren't necessary for this to render properly
1526af8
to
e2ea5d9
Compare
Now when the padding option is off, it won't include any spaces in between `|` characters, since they aren't necessary for the output to render properly.
Now we have "on", "off", and "some". "some" will add a space at the beginning and end of each cell to balance between readability while still trying to minimize token count.
@datalogics-jacksonm Thanks for creating a PR! Looks good so far 👍 However I am not sure about the on/off/some naming. That is not that descriptive. Can you brainstorm some other namings? For example, a cell padding of aligned/minimal/none.
|
Totally agree, the naming convention was what I was most unsure about - I'll make that adjustment, thanks for the suggestion! |
JohannesKaufmann#161 (comment) Adjusted naming convention from the old "PadColumns" with options "on", "off", and "some" to the new "CellPadding" with options "aligned", "minimal", and "none".
Also figured I'd show some real-world performance here! Take this table as an example. With the default behavior of
And now with the
Not really very readable for humans, but it's fully able to be rendered as markdown. Dropping both outputs into https://gpt-tokenizer.dev/, you'll get:
So we get a reduction in tokens of ~32%. That is, of course in stark opposition to the whopping ~90% reduction in characters! This is just due to the optimizations of the o200k_base encoding that a majority of LLMs use. Long strings of And a fun thing I didn't expect, if we take the output of
And throw that into https://gpt-tokenizer.dev/, we get the numbers 1,823 characters, and 443 tokens. That means with just these few extra spaces, we go from a ~32% reduction down to a ~24% reduction in tokens! That's only a 5.5% increase in characters corresponding to a 12% increase in tokens. I didn't expect that to be so drastic, and on one table I tried, on the I think in general, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, mostly setting the default and consistent naming
switch behavior { | ||
case "": | ||
// Allow empty string to default to "aligned" | ||
p.cellPaddingBehavior = CellPaddingBehavior(CellPaddingAligned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wrapping with CellPaddingBehavior()
is not needed here since it has already the correct type.
commonmark.NewCommonmarkPlugin(), | ||
NewTablePlugin(), | ||
NewTablePlugin( | ||
WithCellPadding(CellPaddingAligned), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove that since we want to test the default here.
Instead we can set a default:
func NewTablePlugin(opts ...option) converter.Plugin {
plugin := &tablePlugin{
cellPaddingBehavior: CellPaddingAligned,
}
commonmark.NewCommonmarkPlugin(), | ||
NewTablePlugin( | ||
WithSpanCellBehavior("random"), | ||
WithCellPadding(CellPaddingAligned), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the default set, adding it to every testcase is not nessesary anymore.
} | ||
|
||
func TestOptionFunc_PadColumns(t *testing.T) { | ||
testCases := []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small thing: Can we always follow the order of 1) aligned 2) minimal and 3) none to keep the consistency.
} | ||
} | ||
|
||
func TestOptionFunc_PadColumns(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to TestOptionFunc_CellPadding
CellPaddingNone CellPaddingBehavior = "none" | ||
) | ||
|
||
// WithPadColumns configures how to handle padding in table cells. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to WithCellPadding
return nil | ||
|
||
default: | ||
return fmt.Errorf("unknown value %q for pad columns behavior", behavior) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update naming
#145
Implements feature as requested above (more or less). Drastically reduces the amount of characters in the output, making the output much more friendly to LLMs due to their finite context windows.
Essentially creates every table with the minimum amount of characters necessary to still be valid GFM Tables.
Default output should remain the same. New flag is
--opt-table-cell-padding="[value]"
, where the value can bealigned
,minimal
, ornone
(default isaligned
).The
minimal
value just preserves the spaces before and after each cell's contents. Balances readability with reducing the character count.