Add support to parse multiple document headers when getting the document number from PDFs

In https://github.com/freelawproject/courtlistener/issues/5244 we found that there are some PDFs that contains two headers with different document number each one.

![Image](https://github.com/user-attachments/assets/776d436c-59e2-4b0c-9564-6cda64bb96dc)

This happens because the document is being uploaded to both the district and appellate cases, which can lead to retrieving the wrong document number.

We should tweak the `get_document_number_from_pdf` method to return the right document number.

Mike proposed:

> One way to deal with this is to have different regexes for parsing each header, and to only use the appellate regex on appellate stuff, and vice versa. Would that work?

In this case, as part of the microservice request, should we pass the court type? Then, depending on the docket number format, we can parse one header or the other.

One question remains: can a bankruptcy case end up in an appellate court as well? If that's the case, and a bankruptcy document is uploaded to an appellate case, relying solely on the docket number format might not be reliable, since appellate and bankruptcy docket formats can be the same.

An alternative approach would be to return not just a single document_number, but a list of document_numbers and the docket numbers found in the document. Then, the client can decide which one to use by comparing it to the docket number of the case being processed.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add support to parse multiple document headers when getting the document number from PDFs #200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add support to parse multiple document headers when getting the document number from PDFs #200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions