Improvements to text extraction needed

The Needs OCR function needs to be improved.  Currently we do this to determine if something that is OCR eligible should be OCRd. 

### The Situation 

```
if content.strip() == "" or pdf_has_images(path):
    return True
```

The content is generated from `pdftotext`

using this code 

```
process = subprocess.Popen(
    ["pdftotext", "-layout", "-enc", "UTF-8", path, "-"],
    shell=False,
    stdout=subprocess.PIPE,
    stderr=subprocess.DEVNULL,
)
content, err = process.communicate()
return content.decode(), err, process.returncode
```

later down stream - on CL we take the content - and say - are we sure we didn't need to OCR this and we do this 

```
for line in content.splitlines():
    line = line.strip()
    if line.startswith(("Case", "Appellate", "Appeal", "USCA")):
        continue
    elif line:
        # We found a line with good content. No OCR needed.
        return False

# We arrive here if no line was found containing good content.
return True
```

Where we look for any row that doesnt appear to be a bates stamp.  And as long as we find any text - garbled or otherwise we say we are good to go.  

This leads unfortunately to some seriously garbled plain text in our Recap - and potentially our opinion db.  

### Examples

I dont want to rag on `pdftotext` it has done an admirable job for the most part but I do not think it is the best way to approach what we dealing with now.  For one - we are attempting to extract out content and place it into a plain text db field.  This is challenging because a good amount of documents contain pdf objects, such as  `/widgets`, `/annotations`, `/freetext`, `/Stamp` and `/Popup`.  Although this is not an exhaustive list we see links and signatures, and I'm sure more types.  

In addition to the complexity of handling documents that contain pdf stream objects, we also have to deal with images inserted into PDFs or even worse, the first or maybe just the last page being a rasterized PDF pages while the middle 30 odd pages being vector PDFs.  

In this case - our checks fail and have no way to catch them because - after we iterate beyond the bates stamp on page 2 we get good text.  See: `gov.uscourts.nysd.411264.100.0.pdf`

This also fails when - for example, a free text widget is added on to the PDF page of an image that crosses out content or adds content to the page. 

Here is an example - of a non image pdf page - containing Free Text widget (widget I think, it could be something different) meant to cross out the PROPOSED part. 
<img width="500" alt="Screenshot 2024-04-19 at 11 29 59 AM" src="https://github.com/freelawproject/doctor/assets/6464529/07abd84c-8fbb-4bbe-b35a-b2afec10a1ce">

This is not the perfect example, because the underlying content appears to contain text but is corrupted and looks like this 

<img width="500" alt="Screenshot 2024-04-19 at 11 32 02 AM" src="https://github.com/freelawproject/doctor/assets/6464529/686ff76e-1d95-45bb-9239-e471d016911c">

In fact, [williams-v-t-mobile](https://www.courtlistener.com/docket/6117444/25/darrell-williams-v-t-mobile-usa-inc/) 

### Side by Side comparison of Williams v T-Mobile

Note the proposed - is incorrectly added here to the text frustrating the adjustment made by the court.  Which is noted in the document itself.
<table>
  <tr>
    <td>
      <img src="https://github.com/freelawproject/doctor/assets/6464529/59063733-824b-445a-9120-2ca290c96847" width="350" alt="Screenshot 2024-04-19 at 11 35 25 AM">
    </td>
    <td>
      <img src="https://github.com/freelawproject/doctor/assets/6464529/f78a800e-d78d-47b2-bc77-4722c1447bb0" width="350" alt="Screenshot 2024-04-19 at 11 36 36 AM">
    </td>
  </tr>
</table>


### Angled, Circular, and Sideways Text 

Not to be out done - many judges - 👋 `CAND` likes to use Stamps with circular text.  These stamps are often at the end of the document but not exclusively.  In doing that the courts introduce gibberish into our documents when we extract the text or OCR them.  

For example `gov.uscourts.cand.16711.1203.0.pdf` and another file have them adjacent to the text.  One of these is stamped into an image pdf and the other is in a regular pdf and garbles it.  
 
<img width="350" alt="Screenshot 2024-04-19 at 11 48 19 AM" src="https://github.com/freelawproject/doctor/assets/6464529/0aa1d7ed-64a1-444e-94ac-f7916c554d19">

<img width="350" alt="Screenshot 2024-04-19 at 11 47 37 AM" src="https://github.com/freelawproject/doctor/assets/6464529/f2e9d554-e6d2-4d3e-ae8f-00f8f3bb0b7d">

In both cases - the content that is generated makes the test for OCR fail to identify a needed OCR. 

## Sideways Text

We also run into this problem where - pdftotext does an amazing job of figuring out the text on the side and writing it into the text.  But here is the result - this is just a fancy thing some courts - and some firms like to do. 
<img width="684" alt="Screenshot 2024-04-19 at 11 55 32 AM" src="https://github.com/freelawproject/doctor/assets/6464529/a64159e2-9211-493c-9b55-fe7b45cfcff4">
But look at the result.  It unnaturally expands the plain text - and frustrates plain text searches for sure.  
<img width="1324" alt="Screenshot 2024-04-19 at 11 54 41 AM" src="https://github.com/freelawproject/doctor/assets/6464529/4c03bcba-091a-461d-a6ce-8b27caca15c0">

In this case and in others see below. 

## Margin Text

Occasionally the use of margin text in small font causes some weird creations in text.  which again cause extra wide text that is hard to view and display and which I think make it hard to query or search for the content you may be looking for. 

<img width="631" alt="Screenshot 2024-04-19 at 11 57 01 AM" src="https://github.com/freelawproject/doctor/assets/6464529/0c9369fe-6e9a-45a1-adf2-e4603bff7ad3">

<img width="250" alt="Screenshot 2024-04-19 at 11 59 15 AM" src="https://github.com/freelawproject/doctor/assets/6464529/d9baf0f2-a745-47ce-8cdf-ce44ab47308f">

## Final complaint (Bates Stamps) 

Bates stamps on every page are ingested into the content and dont reflect the document that was generated.  I would not expect to see bates stamps or sidebar content in a published book so I dont think we should display it in the plain text.  

# What should we do

If you've read this far @mlissner I know you must be dying to hear what I think the solution happens to be.  

We should drop (I think) `pdftotext` for you guessed it `pdfplumber`.  

Pdfplumber can both sample the pdfs better to determine if the entire page is likely an image - while correctly guessing that lines or signatures are in the document and leaving be.  Additionally, we can easily extract out the pure text of the document while avoiding the pitfalls contained.  

We should drop the check in CL and just make all the assessments done here in doctor as well.  

Solutions coming in the next post.  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improvements to text extraction needed #186

The Situation

Examples

Side by Side comparison of Williams v T-Mobile

Angled, Circular, and Sideways Text

Sideways Text

Margin Text

Final complaint (Bates Stamps)

What should we do

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improvements to text extraction needed #186

Description

The Situation

Examples

Side by Side comparison of Williams v T-Mobile

Angled, Circular, and Sideways Text

Sideways Text

Margin Text

Final complaint (Bates Stamps)

What should we do

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions