Incorrect plaintiff extraction in eyecite

When parsing the following text with eyecite:

`text = 'Lee County School Dist. No. 1 v. Gardner,  263 F.Supp. 26 (SC 1967)'
`

The plaintiff is incorrectly extracted as 1 instead of Lee County School Dist. No. 1.

The issue seems to arise from the algorithm used to identify the plaintiff, which relies on extracting the two "words" immediately preceding the stopword v.. However, the current implementation appears to count spaces as separate "words," which leads to incorrect results for plaintiffs with longer names.

For example, in the problematic text above, the tokenized list of "words" is:

`['Lee', ' ', 'County', ' ', 'School', ' ', 'Dist.', ' ', 'No.', ' ', '1', ' ', StopWordToken(data='v.', start=30, end=32, groups={'stop_word': 'v'}), ' ', 'Gardner,', ' ', CitationToken(data='263 F.Supp. 26', start=42, end=56, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']`

Using the current algorithm, the two tokens before v. are ['1', ' '], which is incorrect.

This logic works fine for shorter plaintiff names, such as:

`text = 'Smith v. Bar, 263 F.Supp. 26 (SC 1967)'
`
Here, the tokenized list is:

['Smith', ' ', StopWordToken(data='v.', start=6, end=8, groups={'stop_word': 'v'}), ' ', 'Bar,', ' ', CitationToken(data='263 F.Supp. 26', start=14, end=28, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']

The two tokens before v. are ['Smith', ' '], which correctly identifies the plaintiff as Smith.

The current approach:

```
if isinstance(word, StopWordToken):
    if word.groups["stop_word"] == "v" and index > 0:
        citation.metadata.plaintiff = "".join(
            str(w) for w in words[max(index - 2, 0) : index]
        ).strip()
```

is limited to selecting the last two tokens before v.. This works for short names but fails for plaintiffs with longer names like Lee County School Dist. No. 1.

I'm guessing this was set to two elements before v. because is common for plaintiffs to have short names.







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect plaintiff extraction in eyecite #193

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Incorrect plaintiff extraction in eyecite #193

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions