Replace passivepy with a call to an LLM#147
Replace passivepy with a call to an LLM#147nonprofittechy merged 10 commits intomigrate-from-spaCy-and-nltkfrom
Conversation
…te unit test accordingly
There was a problem hiding this comment.
Pull Request Overview
This PR replaces PassivePy (a Python library for passive voice detection) with a call to OpenAI's LLM (gpt-5-nano) for passive voice detection in text analysis, moving from a local library to an AI-powered cloud solution.
- Removes dependency on PassivePy and tools.suffolklitlab.org API for passive voice detection
- Implements new LLM-based passive voice detection using OpenAI's gpt-5-nano model
- Replaces NLTK sentence tokenization with a regex-based approach to reduce dependencies
Reviewed Changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| formfyxer/passive_voice_detection.py | New module implementing LLM-based passive voice detection with OpenAI API |
| formfyxer/lit_explorer.py | Updated to use new passive voice detection module instead of tools API |
| formfyxer/tests/test_passive_voice_detection.py | Comprehensive unit tests for the new passive voice detection functionality |
| formfyxer/prompts/passive_voice.txt | Prompt template for LLM passive voice classification |
| promptfooconfig.yaml | Configuration for evaluating the LLM passive voice detector |
| test_passive_voice_detection.py | Integration test script for the passive voice detection module |
| formfyxer/tests/passive_voice_test_dataset.csv | Test dataset for passive voice evaluation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@BryceStevenWilley this turned out to take much more testing than I expected--I thought it would be the easier drop-in replacement, lol. But I'll do a future PR off of this branch since we already replace sentence tokenization in this PR. Same errors with the old ML dependencies; going to ignore those for now. |
BryceStevenWilley
left a comment
There was a problem hiding this comment.
LGTM! My only nit is that we should rename + move the integration test file.
Co-authored-by: Bryce Willey <bryce.willey@suffolk.edu>
Co-authored-by: Bryce Willey <bryce.willey@suffolk.edu>
…ance in the repo itself instead of just PR
…r into replace-passivepy
c08bfd4
into
migrate-from-spaCy-and-nltk
This replaces the sentence tokenization we used in a few places with a regular expression (instead of NTLK) and replaces the use of PassivePy (via tools.suffolklitlab.org) with a call to an LLM.
PassivePy states accuracy of 98% on its test dataset; the gpt-5-nano LLM via promptfoo evaluation scores 95.65% on the same dataset of about 1,100 sentences. Spent a lot of time going through multiple rounds of tests and tweaks with few shot with extremely detailed instructions vs zero shot classification, and closer to zero shot with fewer rules in the prompt seems to perform the best for gpt-5-nano. Additionally, when I looked closely at the failures, they seem to mostly be because of ambiguous meanings of sentences that have a valid passive voice interpretation but were marked as active by PassivePY's human annotators. I feel confident that the current performance of the LLM is good enough to capture confusing sentences, as the sentences that our LLM prompt marked "passive" but the human marked "active" confused me!
Some of the "weird" sentences where we disagreed with human annotators:
Some patterns with adjectives vs verb confusion--I agree with the humans after looking closely, but the errors are on weird/ungrammatical sentences, pretty close calls with two valid meanings (one passive and one active), or with ambiguity in usage.
Note that gpt-5-nano is extremely inexpensive, and our prompt does well with caching. Testing 1,100 sentences = 12.5 cents.
If this lets us power off tools.suffolklitlab.org, that would be a significant savings, as this is likely to cost less than a dollar a month for even quite high usage.
Additionally, explored using the new Responses API extensively but ultimately stuck with tried and true ChatCompletion; Responses cannot be tested in the current version of PromptFoo and it seems that performance was worse than with the older ChatCompletion (But again, hard to test with promptfoo; any gains would be slight reduction in cost, which is fractions of a penny per thousand uses).
Progress toward #145