-
-
Notifications
You must be signed in to change notification settings - Fork 49
Refactor Document model to simplify public-facing API #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…to use common Document object.
The improved code seems good to me, but I haven't really used eyecite in awhile, so I trust the case law team to think about it more carefully. If we do make this change, let's be sure to bump the version properly to indicate the breaking change, and maybe make some good notes in the release notes so people can easily upgrade. |
@mattdahl I need to look at this closer but I think there is some difference between how I would expect a user to generate linked_text
we store a copy of the document so we dont have to generate or regenerate it for making annotations. Although I can imagine annotating looks cleaner here. but wonder if that makes a difference for you |
Ah I see, I missed that. I think that pattern has two things going on:
|
I think you are right that it is hidden but I think if I was refactoring I would want this.
I could argue that we just pass in your markup or your plaintext with the bool -is_markup =True/false and let it handle the the extra cleanup steps. But I havent looked at the code in a minute so im not sure if there are any downsides to this pattern. |
In #240, @flooie introduced a new
Document
class to better keep track of any markup and cleaning steps applied to the text given to eyecite for parsing. However, the class is currently only used internally, meaning that it is still pretty confusing for the user (in my opinion) to keep track of what parameters they should be passing when.For example,
get_citations()
currently accepts both aplain_text
and amarkup_text
parameter, but both are typed optional. So can the user choose which of these to pass, or is the user always supposed to passplain_text
no matter what? Ifmarkup_text
exists, should it be passed in addition toplain_text
, or instead of it? Shouldplain_text
still be cleaned by the user in advance? But isn't theDocument
abstraction supposed to keep track of the cleaning steps now? Or is that cleaning stage only cleaning formarkup_text
, and if there's other cleaning to be done, the user should still handle that in advance? And when the user later callsannotate_citations()
, they're supposed to remember what the originalplain_text
was (along with thesource_text
, an entirely new variable?), and pass some combination of it all over again?This PR attempts to resolve these confusions by simply exposing the internal
Document
model to the user. The idea is that the user should always instantiate aDocument
himself, and then just pass that object toget_citations()
andannotate_citations()
. All the pre-processing related parameters (whether the text has markup, any cleaning steps, what diffing algorithm to use, etc.) are dealt with only once at theDocument
level and then never needed again (from the user's perspective).So with this PR, for example, the user would do:
Instead of (current syntax):
This PR is purely refactoring the ergonomics of the API -- it doesn't change any functionality. (The changes to the tests are just syntactical.) I think it's an improvement, but I recognize that breaking the API is always controversial, so open to discussion. And I think the idea to introduce the
Document
abstraction in the first place is great.Some other documentation improvement-related PRs to follow soon as well.