-
Notifications
You must be signed in to change notification settings - Fork 222
feat: create harper-thesaurus
#2085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Since Harper is a linter at heart it needs to begin with flagging a run of text. Do you envisage a thesaurus for changing any word to a synonym or for flagging certain specific words and then suggesting synonyms? So far the latter seems to fit Harper but not the former. I've been toying with this myself for certain kinds of buzzwords and overused words that have become fashionable:
Some require some heuristics to decide if they're OK or should be flagged. Or to show a different The swear word replacement linter is kind of in this category too. |
That's a good point. At the moment my goal with this is to create a basic backbone for a thesaurus and to try experimenting by integrating it with a linter. I'm hoping that will provide a better idea as to how practical/useful this feature would be. Currently, I'm only envisioning it as providing a basic common thesaurus API that linters could call into to provide synonyms when they deem beneficial. (In the future, if there's interest in it, I could see One concern I currently have is the large size of the thesaurus. The source text file is ~24MB, and I'm not sure how workable and/or justifiable that is, especially if the feature isn't used much. I guess it might be possible to make it an optional feature, or perhaps to compress the file in some way too. I'd be curious to see how it affects the binary size and benchmarks once I can integrate it with a linter. |
That's pretty big. Last time I looked at the size of the Harper curated dictionary it was less than 60,000 lines... in fact it's 754,147 bytes on disk right now. At the moment the dictionary has some amount of affix compression while I'll assume the thesaurus has no compression. But since they are text files presumably with words referencing other words, it shouldn't be too hard to come up with a format where most words appear as full strings only once and the rest are references as string slices. Also every word could just have a numeric ID and thence reference each other by ID, which may or may not be smaller than a string slice. |
Add `harper-thesaurus` as optional dependency.
Add a module that makes it easier to use the thesaurus from `harper-core`.
|
I adore the idea of doing stuff with a thesaurus, but (like @hippietrail), I'm concerned about the filesize. My initial reaction is that we could have synonym lookups be a separate feature from the linter and happen on-demand over the network? We could also provide multiple versions of the Harper binary. One would include only the linter with a limited dataset (call it I eagerly await your thoughts. |
|
I'd love to see what you have been cooking up. As this is a draft PR, I see no problem with your pushing up your work, even if it's unfinished. |
Sort by `TokenKind` similarity in addition to sorting by word frequency.
|
Random thought: Instead of storing the full words to create the network, you could store the |
I think this is a great first step. We can also remove any entries we don't consider "common" according to the metadata. |

Description
Creates
harper-thesaurus, laying the initial groundwork for providing synonyms in Harper.This uses Moby Thesaurus II, a public domain thesaurus.
TODO
BoringWordslinter, providing suggestions for alternative words.How Has This Been Tested?
cargo testChecklist