Description
The Needs OCR function needs to be improved. Currently we do this to determine if something that is OCR eligible should be OCRd.
The Situation
if content.strip() == "" or pdf_has_images(path):
return True
The content is generated from pdftotext
using this code
process = subprocess.Popen(
["pdftotext", "-layout", "-enc", "UTF-8", path, "-"],
shell=False,
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
)
content, err = process.communicate()
return content.decode(), err, process.returncode
later down stream - on CL we take the content - and say - are we sure we didn't need to OCR this and we do this
for line in content.splitlines():
line = line.strip()
if line.startswith(("Case", "Appellate", "Appeal", "USCA")):
continue
elif line:
# We found a line with good content. No OCR needed.
return False
# We arrive here if no line was found containing good content.
return True
Where we look for any row that doesnt appear to be a bates stamp. And as long as we find any text - garbled or otherwise we say we are good to go.
This leads unfortunately to some seriously garbled plain text in our Recap - and potentially our opinion db.
Examples
I dont want to rag on pdftotext
it has done an admirable job for the most part but I do not think it is the best way to approach what we dealing with now. For one - we are attempting to extract out content and place it into a plain text db field. This is challenging because a good amount of documents contain pdf objects, such as /widgets
, /annotations
, /freetext
, /Stamp
and /Popup
. Although this is not an exhaustive list we see links and signatures, and I'm sure more types.
In addition to the complexity of handling documents that contain pdf stream objects, we also have to deal with images inserted into PDFs or even worse, the first or maybe just the last page being a rasterized PDF pages while the middle 30 odd pages being vector PDFs.
In this case - our checks fail and have no way to catch them because - after we iterate beyond the bates stamp on page 2 we get good text. See: gov.uscourts.nysd.411264.100.0.pdf
This also fails when - for example, a free text widget is added on to the PDF page of an image that crosses out content or adds content to the page.
Here is an example - of a non image pdf page - containing Free Text widget (widget I think, it could be something different) meant to cross out the PROPOSED part.
This is not the perfect example, because the underlying content appears to contain text but is corrupted and looks like this

In fact, williams-v-t-mobile
Side by Side comparison of Williams v T-Mobile
Note the proposed - is incorrectly added here to the text frustrating the adjustment made by the court. Which is noted in the document itself.
![]() |
![]() |
Angled, Circular, and Sideways Text
Not to be out done - many judges - 👋 CAND
likes to use Stamps with circular text. These stamps are often at the end of the document but not exclusively. In doing that the courts introduce gibberish into our documents when we extract the text or OCR them.
For example gov.uscourts.cand.16711.1203.0.pdf
and another file have them adjacent to the text. One of these is stamped into an image pdf and the other is in a regular pdf and garbles it.


In both cases - the content that is generated makes the test for OCR fail to identify a needed OCR.
Sideways Text
We also run into this problem where - pdftotext does an amazing job of figuring out the text on the side and writing it into the text. But here is the result - this is just a fancy thing some courts - and some firms like to do.
But look at the result. It unnaturally expands the plain text - and frustrates plain text searches for sure.
In this case and in others see below.
Margin Text
Occasionally the use of margin text in small font causes some weird creations in text. which again cause extra wide text that is hard to view and display and which I think make it hard to query or search for the content you may be looking for.


Final complaint (Bates Stamps)
Bates stamps on every page are ingested into the content and dont reflect the document that was generated. I would not expect to see bates stamps or sidebar content in a published book so I dont think we should display it in the plain text.
What should we do
If you've read this far @mlissner I know you must be dying to hear what I think the solution happens to be.
We should drop (I think) pdftotext
for you guessed it pdfplumber
.
Pdfplumber can both sample the pdfs better to determine if the entire page is likely an image - while correctly guessing that lines or signatures are in the document and leaving be. Additionally, we can easily extract out the pure text of the document while avoiding the pitfalls contained.
We should drop the check in CL and just make all the assessments done here in doctor as well.
Solutions coming in the next post.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status