Skip to content

Improvements to text extraction needed #186

Open
@flooie

Description

@flooie

The Needs OCR function needs to be improved. Currently we do this to determine if something that is OCR eligible should be OCRd.

The Situation

if content.strip() == "" or pdf_has_images(path):
    return True

The content is generated from pdftotext

using this code

process = subprocess.Popen(
    ["pdftotext", "-layout", "-enc", "UTF-8", path, "-"],
    shell=False,
    stdout=subprocess.PIPE,
    stderr=subprocess.DEVNULL,
)
content, err = process.communicate()
return content.decode(), err, process.returncode

later down stream - on CL we take the content - and say - are we sure we didn't need to OCR this and we do this

for line in content.splitlines():
    line = line.strip()
    if line.startswith(("Case", "Appellate", "Appeal", "USCA")):
        continue
    elif line:
        # We found a line with good content. No OCR needed.
        return False

# We arrive here if no line was found containing good content.
return True

Where we look for any row that doesnt appear to be a bates stamp. And as long as we find any text - garbled or otherwise we say we are good to go.

This leads unfortunately to some seriously garbled plain text in our Recap - and potentially our opinion db.

Examples

I dont want to rag on pdftotext it has done an admirable job for the most part but I do not think it is the best way to approach what we dealing with now. For one - we are attempting to extract out content and place it into a plain text db field. This is challenging because a good amount of documents contain pdf objects, such as /widgets, /annotations, /freetext, /Stamp and /Popup. Although this is not an exhaustive list we see links and signatures, and I'm sure more types.

In addition to the complexity of handling documents that contain pdf stream objects, we also have to deal with images inserted into PDFs or even worse, the first or maybe just the last page being a rasterized PDF pages while the middle 30 odd pages being vector PDFs.

In this case - our checks fail and have no way to catch them because - after we iterate beyond the bates stamp on page 2 we get good text. See: gov.uscourts.nysd.411264.100.0.pdf

This also fails when - for example, a free text widget is added on to the PDF page of an image that crosses out content or adds content to the page.

Here is an example - of a non image pdf page - containing Free Text widget (widget I think, it could be something different) meant to cross out the PROPOSED part.
Screenshot 2024-04-19 at 11 29 59 AM

This is not the perfect example, because the underlying content appears to contain text but is corrupted and looks like this

Screenshot 2024-04-19 at 11 32 02 AM

In fact, williams-v-t-mobile

Side by Side comparison of Williams v T-Mobile

Note the proposed - is incorrectly added here to the text frustrating the adjustment made by the court. Which is noted in the document itself.

Screenshot 2024-04-19 at 11 35 25 AM Screenshot 2024-04-19 at 11 36 36 AM

Angled, Circular, and Sideways Text

Not to be out done - many judges - 👋 CAND likes to use Stamps with circular text. These stamps are often at the end of the document but not exclusively. In doing that the courts introduce gibberish into our documents when we extract the text or OCR them.

For example gov.uscourts.cand.16711.1203.0.pdf and another file have them adjacent to the text. One of these is stamped into an image pdf and the other is in a regular pdf and garbles it.

Screenshot 2024-04-19 at 11 48 19 AM Screenshot 2024-04-19 at 11 47 37 AM

In both cases - the content that is generated makes the test for OCR fail to identify a needed OCR.

Sideways Text

We also run into this problem where - pdftotext does an amazing job of figuring out the text on the side and writing it into the text. But here is the result - this is just a fancy thing some courts - and some firms like to do.
Screenshot 2024-04-19 at 11 55 32 AM
But look at the result. It unnaturally expands the plain text - and frustrates plain text searches for sure.
Screenshot 2024-04-19 at 11 54 41 AM

In this case and in others see below.

Margin Text

Occasionally the use of margin text in small font causes some weird creations in text. which again cause extra wide text that is hard to view and display and which I think make it hard to query or search for the content you may be looking for.

Screenshot 2024-04-19 at 11 57 01 AM Screenshot 2024-04-19 at 11 59 15 AM

Final complaint (Bates Stamps)

Bates stamps on every page are ingested into the content and dont reflect the document that was generated. I would not expect to see bates stamps or sidebar content in a published book so I dont think we should display it in the plain text.

What should we do

If you've read this far @mlissner I know you must be dying to hear what I think the solution happens to be.

We should drop (I think) pdftotext for you guessed it pdfplumber.

Pdfplumber can both sample the pdfs better to determine if the entire page is likely an image - while correctly guessing that lines or signatures are in the document and leaving be. Additionally, we can easily extract out the pure text of the document while avoiding the pitfalls contained.

We should drop the check in CL and just make all the assessments done here in doctor as well.

Solutions coming in the next post.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    🆕 New

    Status

    General Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions