Skip to content

Update pacer.py to fix price/cost of transcripts, which are not capped #5990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mikeweinberg
Copy link
Contributor

My attempt at fixing Issue #5429 where cost/price of transcripts aren't being accurately calculated because they are not capped at $3.

I just check the RECAPDocument.description field is transcript (case insensitive). Would we also have to check if its docketentry.description contains something like "NOTICE OF FILING OF OFFICIAL TRANSCRIPT"?

I also added some constants, which perhaps could be imported from elsewhere.

@johnhawkinson
Copy link
Contributor

johnhawkinson commented Jul 16, 2025

I just check the RECAPDocument.description field is transcript (case insensitive). Would we also have to check if its docketentry.description contains something like "NOTICE OF FILING OF OFFICIAL TRANSCRIPT"?

What sort of validation have you done here?

The RECAPDocument.description only exists if we get it from RSS (iirc), and that means it will be missing for many older documents and entirely for some courts.

The docketentry.description description only exists if we have ingested the docket report, and I haven't looked at the 208 instances to see how they vary, but here in D.Mass, the event titled "Notice of Filing of Official Transcript " is merely a notice, not the transcript itself, and has no attached document; and we don't capture event titles, but the docket text is something like "NOTICE is hereby given that an official transcript of a proceeding has been filed by the court reporter in the above-captioned matter. Counsel are referred to the Court's Transcript Redaction Policy, available on the court website at https://www.mad.uscourts.gov/caseinfo/transcripts.htm (DRK)".

The actual transcript event has docket text like "Transcript of Evidentiary Hearing held in Hampden Courtroom as to Nia Dinzey held on April 24, 2025, before Judge Mark G. Mastroianni. Court Reporter Name and Contact Information: Leigh Gershowitz at [email protected] The Transcript may be purchased through the Court Reporter, viewed at the public terminal, or viewed through PACER after it is released. Redaction Request due 8/6/2025. Redacted Transcript Deadline set for 8/18/2025. Release of Transcript Restriction set for 10/14/2025. (DRK)"

But my neighbor to the north in D. New Hampshire skips the two-event process and just dockets a Transcript event with text like "TRANSCRIPT of Proceedings for Motion Hearing held on May 21, 2025. Court Reporter: Susan Bateman, Telephone # 603-225-1453. Transcript is available for public inspection, but may not be copied or otherwise reproduced, at the Clerk's Office for a period of 90 days. Additionally, only attorneys of record and pro se parties with an ECF login and password who purchase a transcript from the court reporter will have access to the transcript through PACER during this 90-day period. If you would like to order a copy, please contact the court reporter at the above listed phone number.

"NOTICE: Any party who requests an original transcript has 21 days from service of this notice to determine whether it is necessary to redact any personal identifiers and, if so, to electronically file a Redaction Request.

"Redaction Request Follow Up 8/1/2025. Redacted Transcript Follow Up 8/11/2025.
Release of Transcript Restriction set for 10/9/2025.(jwb)"

And, say, D.D.C. is like New Hampshire: "TRANSCRIPT OF REMEDIES HEARING PROCEEDINGS - DAY 1 MORNING SESSION before Judge Amit P. Mehta held on April 21, 2025; Page Numbers: 1-130. Court Reporter/ Transcriber: William Zaremba; Email: [email protected]. Transcripts may be ordered by submitting the Transcript Order Form

"For the first 90 days after this filing date, the transcript may be viewed at the courthouse at a public terminal or purchased from the court reporter referenced above. After 90 days, the transcript may be accessed via PACER. Other transcript formats, (multi-page, condensed, PDF or ASCII) may be purchased from the court reporter.

"NOTICE RE REDACTION OF TRANSCRIPTS: The parties have twenty-one days to file with the court and the court reporter any request to redact personal identifiers from this transcript. If no such requests are filed, the transcript will be made available to the public via PACER without redaction after 90 days. The policy, which includes the five personal identifiers specifically covered, is located on our website at www.dcd.uscourts.gov.

Redaction Request due 7/30/2025. Redacted Transcript Deadline set for 8/9/2025.
Release of Transcript Restriction set for 10/7/2025.(Zaremba, William)"

But searching for those strings is probably not sufficient. I don't remember if I've ever seen one, but if somebody files "OBJECTION to [23] TRANSCRIPT OF REMEDIES HEARING PROCEEDINGS - DAY 1 MORNING SESSION," then that would presumably not be something that should be captured.

Ninety-four districts, ninety-four bankruptcy courts, 13 Courts of Appeals (with BAPs), some jay pee em ell and international trade, 204 ways of doing things. Oh, and don't forget those torts.

@mikeweinberg
Copy link
Contributor Author

@johnhawkinson I see what you mean.

I was initially looking at the documents on the prayers leaderboard and noticed a bunch were called "transcript" but yeah I see some have no RECAPDocument.description, and "transcript" isn't universal either.

This docket has transcripts with RECAPDocument.description set to "Transcript (CR)", for example doc number 76, which is listed as $3 but because it's a transcript, people will be billed for 38 pages or $3.80.

And I guess parsing the "Transaction Receipt" page from users is perhaps more trouble than it's worth?

2025-07-16 19_03_16-Clipboard

@johnhawkinson
Copy link
Contributor

( I assume it is obvious that "Transcript (CR)" means "Transcript (Criminal).")

And I guess parsing the "Transaction Receipt" page from users is perhaps more trouble than it's worth?

I don't see why it would be more trouble than it is worth? We already do some parsing of that, and it seems like it is guaranteed to be more reliable than any method that attempts to guess, particularly given what I outlined above. Given the choice between trying to predict what someone else will do, and just trying the thing, we should always try the thing unless it is somehow meaningfully more expensive.

(I think a problem is that we have a substantial corpus of wrongly parsed pricing for transcripts, but why should we let that stop us going forward?)

@v-anne
Copy link
Contributor

v-anne commented Jul 17, 2025

I support fixing this issue, but I'm not sure this is the best solution. I think one of the better things we could do is collect the prices from the transaction receipt pages, at least for new documents going forward.

@mlissner
Copy link
Member

So there's the long description (from the docket entry) and the short description (from the attachment page, among other places).

Could we just use the short description, which is often (always?) the word "Transcript"?:

image

I suspect it would work pretty well?

@johnhawkinson
Copy link
Contributor

So there's the long description (from the docket entry) and the short description (from the attachment page, among other places).
Could we just use the short description, which is often (always?) the word "Transcript"?:

Because transcripts are single-document items (to the best of my knowledge, that is always true), there is no attachment page, so this only comes from the RSS feed. So it is absent for those districts that do not include transcripts in their RSS feeds (esp. those with no RSS feeds at all).

Why would we choose a heuristic we know is unreliable when there is one with perfect fidelity?

(As noted previously, there are several variants, including "Transcript (CR)"; obviously that's a simple substring search in the short description, though, but it is not quite as simple as the match you describe.)

@mlissner
Copy link
Member

Why would we choose a heuristic we know is unreliable when there is one with perfect fidelity?

I like it because it takes one fewer scrape of PACER and we're already doing a lot of scraping these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants