Skip to content

Refactor Document model to simplify public-facing API #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mattdahl
Copy link
Contributor

In #240, @flooie introduced a new Document class to better keep track of any markup and cleaning steps applied to the text given to eyecite for parsing. However, the class is currently only used internally, meaning that it is still pretty confusing for the user (in my opinion) to keep track of what parameters they should be passing when.

For example, get_citations() currently accepts both a plain_text and a markup_text parameter, but both are typed optional. So can the user choose which of these to pass, or is the user always supposed to pass plain_text no matter what? If markup_text exists, should it be passed in addition to plain_text, or instead of it? Should plain_text still be cleaned by the user in advance? But isn't the Document abstraction supposed to keep track of the cleaning steps now? Or is that cleaning stage only cleaning for markup_text, and if there's other cleaning to be done, the user should still handle that in advance? And when the user later calls annotate_citations(), they're supposed to remember what the original plain_text was (along with the source_text, an entirely new variable?), and pass some combination of it all over again?

This PR attempts to resolve these confusions by simply exposing the internal Document model to the user. The idea is that the user should always instantiate a Document himself, and then just pass that object to get_citations() and annotate_citations(). All the pre-processing related parameters (whether the text has markup, any cleaning steps, what diffing algorithm to use, etc.) are dealt with only once at the Document level and then never needed again (from the user's perspective).

So with this PR, for example, the user would do:

from eyecite import get_citations, annotate_citations, Document

document = Document(
    opinion_text,
    has_markup=True,
    clean_steps=["html", "all_whitespace"]
)
citations = get_citations(document)
linked_text = annotate_citations(
     document,
     annotations=[[c.span(), "<a>", "</a>"] for c in citations]
)

Instead of (current syntax):

from eyecite import get_citations, annotate_citations, clean_text

markup_text = opinion_text
plain_text = clean_text(markup_text, steps=["html", "all_whitespace"])  # I think?

citations = get_citations(
    plain_text=plain_text,
    markup_text=markup_text,
    clean_steps=["html", "all_whitespace"]
)
linked_text = annotate_citations(
    plain_text = plain_text,
    source_text = markup_text
    annotations=[[c.span(), "<a>", "</a>"] for c in citations]
)

This PR is purely refactoring the ergonomics of the API -- it doesn't change any functionality. (The changes to the tests are just syntactical.) I think it's an improvement, but I recognize that breaking the API is always controversial, so open to discussion. And I think the idea to introduce the Document abstraction in the first place is great.

Some other documentation improvement-related PRs to follow soon as well.

@mlissner
Copy link
Member

The improved code seems good to me, but I haven't really used eyecite in awhile, so I trust the case law team to think about it more carefully. If we do make this change, let's be sure to bump the version properly to indicate the breaking change, and maybe make some good notes in the release notes so people can easily upgrade.

@mlissner mlissner requested a review from flooie June 16, 2025 05:12
@mlissner mlissner moved this to To Do in Case Law Sprint Jun 16, 2025
@flooie flooie moved this from To Do to June 15 in Case Law Sprint Jun 16, 2025
@flooie flooie moved this from June 15 to Late June in Case Law Sprint Jun 30, 2025
@flooie
Copy link
Contributor

flooie commented Jun 30, 2025

@mattdahl I need to look at this closer but I think there is some difference between how I would expect a user to generate linked_text

markup_text = "<html goes here...>"
citations = get_citations(
     markup_text=markup_text, clean_steps=["html", "all_whitespace"]
)
document = citations[0].document
linked_text = annotate_citations(
    plain_text=document.plain_text,
    annotations=[(c.span(), "<a>", "</a>") for c in citations],
    source_text=document.markup_text,
    offset_updater=document.plain_to_markup,
)

we store a copy of the document so we dont have to generate or regenerate it for making annotations. Although I can imagine annotating looks cleaner here. but wonder if that makes a difference for you

@flooie flooie moved this from Late June to To Do in Case Law Sprint Jun 30, 2025
@mattdahl
Copy link
Contributor Author

mattdahl commented Jul 1, 2025

Ah I see, I missed that. I think that pattern has two things going on:

  1. Creating the document. Currently this happens silently as a byproduct of calling get_citations() -- I think it's better for the user to consciously see that happening by instantiating the object himself, so he knows it exists for later.
  2. Accessing the document. Currently this is done by grabbing it from citations[0].document -- following from (1), this feels a bit hidden to me. The relationship also feels reversed to me -- if anything, a document should have citations, not the other way around? Here, every citation will have a pointer to the same document, right?

@flooie
Copy link
Contributor

flooie commented Jul 2, 2025

@mattdahl

I think you are right that it is hidden but I think if I was refactoring I would want this.

from eyecite import Document

doc = Document(
    markup_text=opinion_text,
    clean_steps=["html", "all_whitespace"],
)

# or plain text:
# doc = Document(
#     plain_text=opinion_text,
#     clean_steps=["all_whitespace"],
# )

# pull out all the Citation objects
citations = doc.get_citations()

# or perhaps its just doc.citations - because get citations is generated automatically

# build your <a> spans and apply them
markup_annotations = doc.annotate_markup(
    [(c.span(), "<a>", "</a>") for c in citations]
)

I could argue that we just pass in your markup or your plaintext with the bool -is_markup =True/false and let it handle the the extra cleanup steps. But I havent looked at the code in a minute so im not sure if there are any downsides to this pattern.

@flooie flooie moved this from To Do to Blocked in Case Law Sprint Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Blocked
Development

Successfully merging this pull request may close these issues.

3 participants