Skip to content

Add support to parse multiple document headers when getting the document number from PDFs #200

Open
@albertisfu

Description

@albertisfu

In freelawproject/courtlistener#5244 we found that there are some PDFs that contains two headers with different document number each one.

Image

This happens because the document is being uploaded to both the district and appellate cases, which can lead to retrieving the wrong document number.

We should tweak the get_document_number_from_pdf method to return the right document number.

Mike proposed:

One way to deal with this is to have different regexes for parsing each header, and to only use the appellate regex on appellate stuff, and vice versa. Would that work?

In this case, as part of the microservice request, should we pass the court type? Then, depending on the docket number format, we can parse one header or the other.

One question remains: can a bankruptcy case end up in an appellate court as well? If that's the case, and a bankruptcy document is uploaded to an appellate case, relying solely on the docket number format might not be reliable, since appellate and bankruptcy docket formats can be the same.

An alternative approach would be to return not just a single document_number, but a list of document_numbers and the docket numbers found in the document. Then, the client can decide which one to use by comparing it to the docket number of the case being processed.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

K 💲

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions