Skip to content

Extract readable PDFs style information #197

Open
@grossir

Description

@grossir

We already use pdfplumber to extract a readable PDF's text. If we use page.extract_words in conjunction with page.extract_text it will sometimes be able to extract style information. This may improve readability and help on issues like freelawproject/eyecite#198 where style tags <i> and <em> are important.

page_text = page.extract_text(
layout=True, keep_blank_chars=True, y_tolerance=5, y_density=25
)

For example, to get the italics in the second page

Image

import requests
import pdfplumber
from io import BytesIO

url = "https://storage.courtlistener.com/pdf/2025/01/30/georgia_insurers_insolvency_pool_v._logisticare_solutions_llc.pdf"
r = requests.get(url)
pdf = pdfplumber.open(BytesIO(r.content))
words = pdf.pages[1].extract_words(extra_attrs=["fontname"])

In [33]: [i['text'] for i in pdf.pages[1].extract_words(extra_attrs=["fontname"]) if i['fontname'] == 'TKMUCK+EquityARegular,Italic']
Out[33]: ['Wade', 'v.', 'Allstate', 'Fire', '&', 'Cas.', 'Co.']

Then, the italicized words can be resolved to the extracted main text. We would probably need to prebuild a list of courts where this is possible, and filter what styles we want to be preserved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Future...

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions