Skip to content

Conversation

@86xsk
Copy link
Contributor

@86xsk 86xsk commented Oct 16, 2025

Description

Creates harper-thesaurus, laying the initial groundwork for providing synonyms in Harper.

This uses Moby Thesaurus II, a public domain thesaurus.

TODO

  • Develop a way for callers to get a list of synonyms sorted by the frequency of their use. Hopefully making it easier to get and provide relevant synonyms.
  • Integrate with the BoringWords linter, providing suggestions for alternative words.
  • (Try to) apply appropriate inflection to suggested synonyms.
  • Explore alternative methods for integrating thesaurus data.

How Has This Been Tested?

  • cargo test

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

@hippietrail
Copy link
Collaborator

Since Harper is a linter at heart it needs to begin with flagging a run of text. Do you envisage a thesaurus for changing any word to a synonym or for flagging certain specific words and then suggesting synonyms? So far the latter seems to fit Harper but not the former.

I've been toying with this myself for certain kinds of buzzwords and overused words that have become fashionable:

  • "iconic" - mostly now used to mean "special" but sometimes has its real meaning.
  • "diminish" - suddenly the hot new word for "reduce" and other words with similar meanings. Also still used in its original meaning.
  • "uplift" - a verb almost the antonym of "diminish". I don't think it has a common original meaning. The adjective "uplifting" has always been common though.
  • "utilize" / "utilise" - now almost always used as a fancy way of saying "use" though still sometimes seen it its correct sense in British writing.

Some require some heuristics to decide if they're OK or should be flagged. Or to show a different message depending on how strong the clues are that it's rightly/wrongly used. Sometimes we might be able to reword the context in a linter more than what a regular thesaurus.

The swear word replacement linter is kind of in this category too.

@86xsk
Copy link
Contributor Author

86xsk commented Oct 16, 2025

Since Harper is a linter at heart it needs to begin with flagging a run of text. Do you envisage a thesaurus for changing any word to a synonym or for flagging certain specific words and then suggesting synonyms? So far the latter seems to fit Harper but not the former.

That's a good point. At the moment my goal with this is to create a basic backbone for a thesaurus and to try experimenting by integrating it with a linter. I'm hoping that will provide a better idea as to how practical/useful this feature would be.

Currently, I'm only envisioning it as providing a basic common thesaurus API that linters could call into to provide synonyms when they deem beneficial. (In the future, if there's interest in it, I could see harper-ls being able to provide synonyms for words on request too, much like other language servers can provide refactoring actions on demand. But as you mention, that might be out of scope for the project, and it's certainly outside the scope of this PR.)

One concern I currently have is the large size of the thesaurus. The source text file is ~24MB, and I'm not sure how workable and/or justifiable that is, especially if the feature isn't used much. I guess it might be possible to make it an optional feature, or perhaps to compress the file in some way too. I'd be curious to see how it affects the binary size and benchmarks once I can integrate it with a linter.

@hippietrail
Copy link
Collaborator

Since Harper is a linter at heart it needs to begin with flagging a run of text. Do you envisage a thesaurus for changing any word to a synonym or for flagging certain specific words and then > One concern I currently have is the large size of the thesaurus. The source text file is ~24MB, and I'm not sure how workable and/or justifiable that is, especially if the feature isn't used much. I guess it might be possible to make it an optional feature, or perhaps to compress the file in some way too. I'd be curious to see how it affects the binary size and benchmarks once I can integrate it with a linter.

That's pretty big. Last time I looked at the size of the Harper curated dictionary it was less than 60,000 lines... in fact it's 754,147 bytes on disk right now. At the moment the dictionary has some amount of affix compression while I'll assume the thesaurus has no compression. But since they are text files presumably with words referencing other words, it shouldn't be too hard to come up with a format where most words appear as full strings only once and the rest are references as string slices. Also every word could just have a numeric ID and thence reference each other by ID, which may or may not be smaller than a string slice.
Also the thesaurus need not live in memory like the dictionary and could live just on disk. I'm sure it would be possible to implement it once without worrying about efficiency and then improve it with new schemes over time.

@elijah-potter
Copy link
Collaborator

I adore the idea of doing stuff with a thesaurus, but (like @hippietrail), I'm concerned about the filesize.

My initial reaction is that we could have synonym lookups be a separate feature from the linter and happen on-demand over the network?

We could also provide multiple versions of the Harper binary. One would include only the linter with a limited dataset (call it harper-ls-small) and other would include the thesaurus (call it harper-ls-large). This would also give us space to add features like autocomplete.

I eagerly await your thoughts.

@86xsk
Copy link
Contributor Author

86xsk commented Oct 22, 2025

My initial reaction is that we could have synonym lookups be a separate feature from the linter and happen on-demand over the network?

We could also provide multiple versions of the Harper binary. One would include only the linter with a limited dataset (call it harper-ls-small) and other would include the thesaurus (call it harper-ls-large). This would also give us space to add features like autocomplete.

Those are both very interesting ideas that I haven't considered. I'll certainly have to keep them in mind.

I've done some more development on this locally, and I've found that it increases the size of harper-ls by about the same size as the thesaurus file itself (~24MB). I'm curious to explore options regarding reducing the file size, perhaps by filtering out entries we're unlikely to need, or by compressing the file in some way.

In terms of shipping these changes, I guess my current development is more aligned with the second idea you mentioned, with the thesaurus being an optional feature that can be enabled for a build. (Though admittedly I haven't thought too much about that yet, as I've been more focused on just getting some decent output 😆)
image

Unfortunately I haven't been able to dedicate too much time to this as of late, hence progress has been somewhat slow. If there's interest, I could clean up and push the changes I have locally at this point. Though currently there's still the aforementioned quality of output issues, along with what seem to be some noticeable performance regressions (pending further testing).

@elijah-potter
Copy link
Collaborator

I'd love to see what you have been cooking up. As this is a draft PR, I see no problem with your pushing up your work, even if it's unfinished.

@elijah-potter
Copy link
Collaborator

Random thought: Instead of storing the full words to create the network, you could store the WordId. Since it is much smaller, it could compress the full thesaurus down by quite a bit. It would obviously have to be a compile step, however. I understand that would make things much more complex.

@86xsk
Copy link
Contributor Author

86xsk commented Oct 27, 2025

Random thought: Instead of storing the full words to create the network, you could store the WordId. Since it is much smaller, it could compress the full thesaurus down by quite a bit. It would obviously have to be a compile step, however. I understand that would make things much more complex.

If my (slightly sleep-deprived) shell scripting is correct, it unfortunately doesn't look like it would provide too much space savings:
image

Currently I'm thinking of:

  • Compressing/serializing the file in some way. Merely compressing the file with zip brings it down to about ~9MB, so I think there's some decent potential for savings there. Though whether it will be enough is another question.
  • Removing entries that we might not need. For instance, by removing entries for words that don't exist in the curated dictionary. Now that I think about it, I think this should technically be possible to do at build-time too, so we can avoid making destructive changes to the thesaurus file.
  • Rather than making the thesaurus an optional feature that's built-in to the binary, we could simply load a thesaurus file from disk at runtime, much like the user dictionary. However, I'm not sure how workable this would be for the non-native/web builds, since I'm not too experienced on that side of things.

(I still haven't done too much research into this to be honest, though it is my next step. I've been experimenting with trying to inflect the synonyms based on the source word, but I'm starting to feel like that might be better left for a future PR, in the interest of keeping this one reviewable.)

@elijah-potter
Copy link
Collaborator

Removing entries that we might not need. For instance, by removing entries for words that don't exist in the curated dictionary. Now that I think about it, I think this should technically be possible to do at build-time too, so we can avoid making destructive changes to the thesaurus file.

I think this is a great first step. We can also remove any entries we don't consider "common" according to the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants