feat: create `harper-thesaurus` #2085

86xsk · 2025-10-16T01:24:22Z

Description

Creates harper-thesaurus, laying the initial groundwork for providing synonyms in Harper.

This uses Moby Thesaurus II, a public domain thesaurus.

TODO

Develop a way for callers to get a list of synonyms sorted by the frequency of their use. Hopefully making it easier to get and provide relevant synonyms.
Integrate with the BoringWords linter, providing suggestions for alternative words.
(Try to) apply appropriate inflection to suggested synonyms.
Explore alternative methods for integrating thesaurus data.

How Has This Been Tested?

cargo test

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes

hippietrail · 2025-10-16T05:38:29Z

Since Harper is a linter at heart it needs to begin with flagging a run of text. Do you envisage a thesaurus for changing any word to a synonym or for flagging certain specific words and then suggesting synonyms? So far the latter seems to fit Harper but not the former.

I've been toying with this myself for certain kinds of buzzwords and overused words that have become fashionable:

"iconic" - mostly now used to mean "special" but sometimes has its real meaning.
"diminish" - suddenly the hot new word for "reduce" and other words with similar meanings. Also still used in its original meaning.
"uplift" - a verb almost the antonym of "diminish". I don't think it has a common original meaning. The adjective "uplifting" has always been common though.
"utilize" / "utilise" - now almost always used as a fancy way of saying "use" though still sometimes seen it its correct sense in British writing.

Some require some heuristics to decide if they're OK or should be flagged. Or to show a different message depending on how strong the clues are that it's rightly/wrongly used. Sometimes we might be able to reword the context in a linter more than what a regular thesaurus.

The swear word replacement linter is kind of in this category too.

86xsk · 2025-10-16T12:58:01Z

Since Harper is a linter at heart it needs to begin with flagging a run of text. Do you envisage a thesaurus for changing any word to a synonym or for flagging certain specific words and then suggesting synonyms? So far the latter seems to fit Harper but not the former.

That's a good point. At the moment my goal with this is to create a basic backbone for a thesaurus and to try experimenting by integrating it with a linter. I'm hoping that will provide a better idea as to how practical/useful this feature would be.

Currently, I'm only envisioning it as providing a basic common thesaurus API that linters could call into to provide synonyms when they deem beneficial. (In the future, if there's interest in it, I could see harper-ls being able to provide synonyms for words on request too, much like other language servers can provide refactoring actions on demand. But as you mention, that might be out of scope for the project, and it's certainly outside the scope of this PR.)

One concern I currently have is the large size of the thesaurus. The source text file is ~24MB, and I'm not sure how workable and/or justifiable that is, especially if the feature isn't used much. I guess it might be possible to make it an optional feature, or perhaps to compress the file in some way too. I'd be curious to see how it affects the binary size and benchmarks once I can integrate it with a linter.

hippietrail · 2025-10-16T14:17:45Z

Since Harper is a linter at heart it needs to begin with flagging a run of text. Do you envisage a thesaurus for changing any word to a synonym or for flagging certain specific words and then > One concern I currently have is the large size of the thesaurus. The source text file is ~24MB, and I'm not sure how workable and/or justifiable that is, especially if the feature isn't used much. I guess it might be possible to make it an optional feature, or perhaps to compress the file in some way too. I'd be curious to see how it affects the binary size and benchmarks once I can integrate it with a linter.

That's pretty big. Last time I looked at the size of the Harper curated dictionary it was less than 60,000 lines... in fact it's 754,147 bytes on disk right now. At the moment the dictionary has some amount of affix compression while I'll assume the thesaurus has no compression. But since they are text files presumably with words referencing other words, it shouldn't be too hard to come up with a format where most words appear as full strings only once and the rest are references as string slices. Also every word could just have a numeric ID and thence reference each other by ID, which may or may not be smaller than a string slice.
Also the thesaurus need not live in memory like the dictionary and could live just on disk. I'm sure it would be possible to implement it once without worrying about efficiency and then improve it with new schemes over time.

Add `harper-thesaurus` as optional dependency.

Add a module that makes it easier to use the thesaurus from `harper-core`.

elijah-potter · 2025-10-22T16:45:18Z

I adore the idea of doing stuff with a thesaurus, but (like @hippietrail), I'm concerned about the filesize.

My initial reaction is that we could have synonym lookups be a separate feature from the linter and happen on-demand over the network?

We could also provide multiple versions of the Harper binary. One would include only the linter with a limited dataset (call it harper-ls-small) and other would include the thesaurus (call it harper-ls-large). This would also give us space to add features like autocomplete.

I eagerly await your thoughts.

86xsk · 2025-10-22T17:57:51Z

My initial reaction is that we could have synonym lookups be a separate feature from the linter and happen on-demand over the network?

We could also provide multiple versions of the Harper binary. One would include only the linter with a limited dataset (call it harper-ls-small) and other would include the thesaurus (call it harper-ls-large). This would also give us space to add features like autocomplete.

Those are both very interesting ideas that I haven't considered. I'll certainly have to keep them in mind.

I've done some more development on this locally, and I've found that it increases the size of harper-ls by about the same size as the thesaurus file itself (~24MB). I'm curious to explore options regarding reducing the file size, perhaps by filtering out entries we're unlikely to need, or by compressing the file in some way.

In terms of shipping these changes, I guess my current development is more aligned with the second idea you mentioned, with the thesaurus being an optional feature that can be enabled for a build. (Though admittedly I haven't thought too much about that yet, as I've been more focused on just getting some decent output 😆)

Unfortunately I haven't been able to dedicate too much time to this as of late, hence progress has been somewhat slow. If there's interest, I could clean up and push the changes I have locally at this point. Though currently there's still the aforementioned quality of output issues, along with what seem to be some noticeable performance regressions (pending further testing).

elijah-potter · 2025-10-24T14:06:48Z

I'd love to see what you have been cooking up. As this is a draft PR, I see no problem with your pushing up your work, even if it's unfinished.

Sort by `TokenKind` similarity in addition to sorting by word frequency.

elijah-potter · 2025-10-27T21:24:59Z

Random thought: Instead of storing the full words to create the network, you could store the WordId. Since it is much smaller, it could compress the full thesaurus down by quite a bit. It would obviously have to be a compile step, however. I understand that would make things much more complex.

86xsk · 2025-10-27T23:00:09Z

Random thought: Instead of storing the full words to create the network, you could store the WordId. Since it is much smaller, it could compress the full thesaurus down by quite a bit. It would obviously have to be a compile step, however. I understand that would make things much more complex.

If my (slightly sleep-deprived) shell scripting is correct, it unfortunately doesn't look like it would provide too much space savings:

Currently I'm thinking of:

Compressing/serializing the file in some way. Merely compressing the file with zip brings it down to about ~9MB, so I think there's some decent potential for savings there. Though whether it will be enough is another question.
Removing entries that we might not need. For instance, by removing entries for words that don't exist in the curated dictionary. Now that I think about it, I think this should technically be possible to do at build-time too, so we can avoid making destructive changes to the thesaurus file.
Rather than making the thesaurus an optional feature that's built-in to the binary, we could simply load a thesaurus file from disk at runtime, much like the user dictionary. However, I'm not sure how workable this would be for the non-native/web builds, since I'm not too experienced on that side of things.

(I still haven't done too much research into this to be honest, though it is my next step. I've been experimenting with trying to inflect the synonyms based on the source word, but I'm starting to feel like that might be better left for a future PR, in the interest of keeping this one reviewable.)

elijah-potter · 2025-10-28T17:34:04Z

Removing entries that we might not need. For instance, by removing entries for words that don't exist in the curated dictionary. Now that I think about it, I think this should technically be possible to do at build-time too, so we can avoid making destructive changes to the thesaurus file.

I think this is a great first step. We can also remove any entries we don't consider "common" according to the metadata.

feat: create harper-thesaurus

e76602a

86xsk added 4 commits October 17, 2025 11:10

chore(core): add harper-thesaurus as opt. dep.

6671ac4

Add `harper-thesaurus` as optional dependency.

chore(ls): re-export harper-core/thesaurus

b362467

refactor(core): make Suggestion more flexible

aa23e99

feat(core): add module thesaurus_helper

5a87648

Add a module that makes it easier to use the thesaurus from `harper-core`.

86xsk added 6 commits October 24, 2025 14:50

feat(core): provide synonyms in BoringWords lint

d75cdfb

feat(thesaurus): sort synonyms by word frequency

9a0522f

perf(thesaurus): use HashMap to avoid O(n)

c31191e

feat(core): add DictWordMetadata::difference

35ee518

feat(thesaurus): sort by TokenKind similarity

ae8322b

Sort by `TokenKind` similarity in addition to sorting by word frequency.

test(thesaurus): remove unnecessary test

0274214

hippietrail mentioned this pull request Oct 27, 2025

[Obsidian] Synonym lookup under right-click menu #1478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: create `harper-thesaurus` #2085

feat: create `harper-thesaurus` #2085

Uh oh!

86xsk commented Oct 16, 2025 •

edited

Loading

Uh oh!

hippietrail commented Oct 16, 2025

Uh oh!

86xsk commented Oct 16, 2025

Uh oh!

hippietrail commented Oct 16, 2025

Uh oh!

elijah-potter commented Oct 22, 2025

Uh oh!

86xsk commented Oct 22, 2025

Uh oh!

elijah-potter commented Oct 24, 2025

Uh oh!

elijah-potter commented Oct 27, 2025

Uh oh!

86xsk commented Oct 27, 2025 •

edited

Loading

Uh oh!

elijah-potter commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: create harper-thesaurus #2085

Are you sure you want to change the base?

feat: create harper-thesaurus #2085

Uh oh!

Conversation

86xsk commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TODO

How Has This Been Tested?

Checklist

Uh oh!

hippietrail commented Oct 16, 2025

Uh oh!

86xsk commented Oct 16, 2025

Uh oh!

hippietrail commented Oct 16, 2025

Uh oh!

elijah-potter commented Oct 22, 2025

Uh oh!

86xsk commented Oct 22, 2025

Uh oh!

elijah-potter commented Oct 24, 2025

Uh oh!

elijah-potter commented Oct 27, 2025

Uh oh!

86xsk commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elijah-potter commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: create `harper-thesaurus` #2085

feat: create `harper-thesaurus` #2085

86xsk commented Oct 16, 2025 •

edited

Loading

86xsk commented Oct 27, 2025 •

edited

Loading