Skip to content

Incorrect plaintiff extraction in eyecite #193

Open
@quevon24

Description

@quevon24

When parsing the following text with eyecite:

text = 'Lee County School Dist. No. 1 v. Gardner, 263 F.Supp. 26 (SC 1967)'

The plaintiff is incorrectly extracted as 1 instead of Lee County School Dist. No. 1.

The issue seems to arise from the algorithm used to identify the plaintiff, which relies on extracting the two "words" immediately preceding the stopword v.. However, the current implementation appears to count spaces as separate "words," which leads to incorrect results for plaintiffs with longer names.

For example, in the problematic text above, the tokenized list of "words" is:

['Lee', ' ', 'County', ' ', 'School', ' ', 'Dist.', ' ', 'No.', ' ', '1', ' ', StopWordToken(data='v.', start=30, end=32, groups={'stop_word': 'v'}), ' ', 'Gardner,', ' ', CitationToken(data='263 F.Supp. 26', start=42, end=56, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']

Using the current algorithm, the two tokens before v. are ['1', ' '], which is incorrect.

This logic works fine for shorter plaintiff names, such as:

text = 'Smith v. Bar, 263 F.Supp. 26 (SC 1967)'
Here, the tokenized list is:

['Smith', ' ', StopWordToken(data='v.', start=6, end=8, groups={'stop_word': 'v'}), ' ', 'Bar,', ' ', CitationToken(data='263 F.Supp. 26', start=14, end=28, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']

The two tokens before v. are ['Smith', ' '], which correctly identifies the plaintiff as Smith.

The current approach:

if isinstance(word, StopWordToken):
    if word.groups["stop_word"] == "v" and index > 0:
        citation.metadata.plaintiff = "".join(
            str(w) for w in words[max(index - 2, 0) : index]
        ).strip()

is limited to selecting the last two tokens before v.. This works for short names but fails for plaintiffs with longer names like Lee County School Dist. No. 1.

I'm guessing this was set to two elements before v. because is common for plaintiffs to have short names.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    General Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions