Extract readable PDFs style information

We already use `pdfplumber` to extract a readable PDF's text. If we use `page.extract_words` in conjunction with  `page.extract_text` it will sometimes be able to extract style information. This may improve readability and help on issues like https://github.com/freelawproject/eyecite/issues/198 where style tags  `<i>` and `<em>`  are important. 

https://github.com/freelawproject/doctor/blob/9e0e76f9b767d860590274cbf090ba614bbf849e/doctor/lib/text_extraction.py#L67-L69


For [example](https://www.courtlistener.com/opinion/10323895/georgia-insurers-insolvency-pool-v-logisticare-solutions-llc/?type=o&type=o&q=&order_by=dateFiled+desc), to get the italics in the second page

![Image](https://github.com/user-attachments/assets/78cfa76f-433b-4974-9918-07b0967e60e1)

```python
import requests
import pdfplumber
from io import BytesIO

url = "https://storage.courtlistener.com/pdf/2025/01/30/georgia_insurers_insolvency_pool_v._logisticare_solutions_llc.pdf"
r = requests.get(url)
pdf = pdfplumber.open(BytesIO(r.content))
words = pdf.pages[1].extract_words(extra_attrs=["fontname"])

In [33]: [i['text'] for i in pdf.pages[1].extract_words(extra_attrs=["fontname"]) if i['fontname'] == 'TKMUCK+EquityARegular,Italic']
Out[33]: ['Wade', 'v.', 'Allstate', 'Fire', '&', 'Cas.', 'Co.']
```

Then, the italicized words can be resolved to the extracted main text. We would probably need to prebuild a list of courts where this is possible, and filter what styles we want to be preserved. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extract readable PDFs style information #197

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	page_text = page.extract_text(
	layout=True, keep_blank_chars=True, y_tolerance=5, y_density=25
	)

Uh oh!

Extract readable PDFs style information #197

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions