Open
Description
We already use pdfplumber
to extract a readable PDF's text. If we use page.extract_words
in conjunction with page.extract_text
it will sometimes be able to extract style information. This may improve readability and help on issues like freelawproject/eyecite#198 where style tags <i>
and <em>
are important.
doctor/doctor/lib/text_extraction.py
Lines 67 to 69 in 9e0e76f
For example, to get the italics in the second page
import requests
import pdfplumber
from io import BytesIO
url = "https://storage.courtlistener.com/pdf/2025/01/30/georgia_insurers_insolvency_pool_v._logisticare_solutions_llc.pdf"
r = requests.get(url)
pdf = pdfplumber.open(BytesIO(r.content))
words = pdf.pages[1].extract_words(extra_attrs=["fontname"])
In [33]: [i['text'] for i in pdf.pages[1].extract_words(extra_attrs=["fontname"]) if i['fontname'] == 'TKMUCK+EquityARegular,Italic']
Out[33]: ['Wade', 'v.', 'Allstate', 'Fire', '&', 'Cas.', 'Co.']
Then, the italicized words can be resolved to the extracted main text. We would probably need to prebuild a list of courts where this is possible, and filter what styles we want to be preserved.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status