Refactor Document model to simplify public-facing API #278

mattdahl · 2025-06-15T23:40:53Z

In #240, @flooie introduced a new Document class to better keep track of any markup and cleaning steps applied to the text given to eyecite for parsing. However, the class is currently only used internally, meaning that it is still pretty confusing for the user (in my opinion) to keep track of what parameters they should be passing when.

For example, get_citations() currently accepts both a plain_text and a markup_text parameter, but both are typed optional. So can the user choose which of these to pass, or is the user always supposed to pass plain_text no matter what? If markup_text exists, should it be passed in addition to plain_text, or instead of it? Should plain_text still be cleaned by the user in advance? But isn't the Document abstraction supposed to keep track of the cleaning steps now? Or is that cleaning stage only cleaning for markup_text, and if there's other cleaning to be done, the user should still handle that in advance? And when the user later calls annotate_citations(), they're supposed to remember what the original plain_text was (along with the source_text, an entirely new variable?), and pass some combination of it all over again?

This PR attempts to resolve these confusions by simply exposing the internal Document model to the user. The idea is that the user should always instantiate a Document himself, and then just pass that object to get_citations() and annotate_citations(). All the pre-processing related parameters (whether the text has markup, any cleaning steps, what diffing algorithm to use, etc.) are dealt with only once at the Document level and then never needed again (from the user's perspective).

So with this PR, for example, the user would do:

from eyecite import get_citations, annotate_citations, Document

document = Document(
    opinion_text,
    has_markup=True,
    clean_steps=["html", "all_whitespace"]
)
citations = get_citations(document)
linked_text = annotate_citations(
     document,
     annotations=[[c.span(), "<a>", "</a>"] for c in citations]
)

Instead of (current syntax):

from eyecite import get_citations, annotate_citations, clean_text

markup_text = opinion_text
plain_text = clean_text(markup_text, steps=["html", "all_whitespace"])  # I think?

citations = get_citations(
    plain_text=plain_text,
    markup_text=markup_text,
    clean_steps=["html", "all_whitespace"]
)
linked_text = annotate_citations(
    plain_text = plain_text,
    source_text = markup_text
    annotations=[[c.span(), "<a>", "</a>"] for c in citations]
)

This PR is purely refactoring the ergonomics of the API -- it doesn't change any functionality. (The changes to the tests are just syntactical.) I think it's an improvement, but I recognize that breaking the API is always controversial, so open to discussion. And I think the idea to introduce the Document abstraction in the first place is great.

Some other documentation improvement-related PRs to follow soon as well.

…to use common Document object.

…yntax.

mlissner · 2025-06-16T05:12:18Z

The improved code seems good to me, but I haven't really used eyecite in awhile, so I trust the case law team to think about it more carefully. If we do make this change, let's be sure to bump the version properly to indicate the breaking change, and maybe make some good notes in the release notes so people can easily upgrade.

flooie · 2025-06-30T14:24:01Z

@mattdahl I need to look at this closer but I think there is some difference between how I would expect a user to generate linked_text

markup_text = "<html goes here...>"
citations = get_citations(
     markup_text=markup_text, clean_steps=["html", "all_whitespace"]
)
document = citations[0].document
linked_text = annotate_citations(
    plain_text=document.plain_text,
    annotations=[(c.span(), "<a>", "</a>") for c in citations],
    source_text=document.markup_text,
    offset_updater=document.plain_to_markup,
)

we store a copy of the document so we dont have to generate or regenerate it for making annotations. Although I can imagine annotating looks cleaner here. but wonder if that makes a difference for you

mattdahl · 2025-07-01T05:38:48Z

Ah I see, I missed that. I think that pattern has two things going on:

Creating the document. Currently this happens silently as a byproduct of calling get_citations() -- I think it's better for the user to consciously see that happening by instantiating the object himself, so he knows it exists for later.
Accessing the document. Currently this is done by grabbing it from citations[0].document -- following from (1), this feels a bit hidden to me. The relationship also feels reversed to me -- if anything, a document should have citations, not the other way around? Here, every citation will have a pointer to the same document, right?

flooie · 2025-07-02T12:49:27Z

@mattdahl

I think you are right that it is hidden but I think if I was refactoring I would want this.

from eyecite import Document

doc = Document(
    markup_text=opinion_text,
    clean_steps=["html", "all_whitespace"],
)

# or plain text:
# doc = Document(
#     plain_text=opinion_text,
#     clean_steps=["all_whitespace"],
# )

# pull out all the Citation objects
citations = doc.get_citations()

# or perhaps its just doc.citations - because get citations is generated automatically

# build your <a> spans and apply them
markup_annotations = doc.annotate_markup(
    [(c.span(), "<a>", "</a>") for c in citations]
)

I could argue that we just pass in your markup or your plaintext with the bool -is_markup =True/false and let it handle the the extra cleanup steps. But I havent looked at the code in a minute so im not sure if there are any downsides to this pattern.

mattdahl added 4 commits June 15, 2025 15:47

refactor(models): Refactors get_citations() and annotate_citations() …

306c3cf

…to use common Document object.

refactor(tests): Refactors tests to reflect new Document syntax.

728433d

refactor(benchmark): Refactors benchmark script to use new Document s…

a777cbf

…yntax.

refactor(README): Updates README to reflect new Document syntax.

7f15401

mlissner requested a review from flooie June 16, 2025 05:12

mlissner added this to Case Law Sprint Jun 16, 2025

mlissner moved this to To Do in Case Law Sprint Jun 16, 2025

mlissner assigned flooie Jun 16, 2025

flooie moved this from To Do to June 15 in Case Law Sprint Jun 16, 2025

flooie moved this from June 15 to Late June in Case Law Sprint Jun 30, 2025

flooie moved this from Late June to To Do in Case Law Sprint Jun 30, 2025

flooie moved this from To Do to Blocked in Case Law Sprint Jul 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor Document model to simplify public-facing API #278

Refactor Document model to simplify public-facing API #278

Uh oh!

mattdahl commented Jun 15, 2025

Uh oh!

mlissner commented Jun 16, 2025

Uh oh!

flooie commented Jun 30, 2025

Uh oh!

mattdahl commented Jul 1, 2025

Uh oh!

flooie commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Refactor Document model to simplify public-facing API #278

Are you sure you want to change the base?

Refactor Document model to simplify public-facing API #278

Uh oh!

Conversation

mattdahl commented Jun 15, 2025

Uh oh!

mlissner commented Jun 16, 2025

Uh oh!

flooie commented Jun 30, 2025

Uh oh!

mattdahl commented Jul 1, 2025

Uh oh!

flooie commented Jul 2, 2025

Uh oh!

Uh oh!