Skip to content

schema/pdfbox2 fails to extract text as well as pdftotext #20

@jmscott

Description

@jmscott

for a particular pdf

sha:3acd68c1cb7effbc9c2cf50fda6decd96d555d64

the first line of the first page fails to extracted the title correctly

sha:c64e0721c2d5ccdf48992d9a78dbe7d179bbf471

in particular, the venerable pdftotext appears to recognize the newline that
separates the title from the author name. here is the extracted pdftotext blob

sha:c64e0721c2d5ccdf48992d9a78dbe7d179bbf471

why?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions