Skip to content
This repository was archived by the owner on Apr 11, 2025. It is now read-only.

Index error on Hybrid Parser #252

@bosd

Description

@bosd

Describe the bug

In some cases, there is an index error while using the Hybrid parser on a multipage pdf.
It is described and tested in #251
What is merged there is rather a workaround then a fix.
A it now fails gracefully.

Steps to reproduce the bug

See

def test_network_no_infinite_execution(testdir):
"""Test for not infinite execution.
This test used to fail, because the network parse was'nt able to process the tables on this pages.
After a refactor it stops infinite execution. But parsing result could be improved.
Hence this is no qualitative test.
"""
filename = os.path.join(testdir, "tabula/schools.pdf")
tables = camelot.read_pdf(
filename, flavor="network", backend="ghostscript", pages="4"
)
assert len(tables) >= 1

Expected behavior

Potential better fix would be to re-assemble the parts of the table detcted by the netwerk parser into the hybrid parser.
That part of the code also contained a TODO note from the original author.

def _generate_columns_and_rows(self, bbox, user_cols):
# select elements which lie within table_bbox
self.t_bbox = text_in_bbox_per_axis(
bbox, self.horizontal_text, self.vertical_text
)
all_tls = list(
sorted(
filter(
lambda textline: len(textline.get_text().strip()) > 0,
self.t_bbox["horizontal"] + self.t_bbox["vertical"],
),
key=lambda textline: (-textline.y0, textline.x0),
)
)
text_x_min, text_y_min, text_x_max, text_y_max = bbox_from_textlines(all_tls)
# FRHTODO:
# This algorithm takes the horizontal textlines in the bbox, and groups
# them into rows based on their bottom y0.
# That's wrong: it misses the vertical items, and misses out on all
# the alignment identification work we've done earlier.
rows_grouped = self._group_rows(all_tls, row_tol=self.row_tol)
rows = self._join_rows(rows_grouped, text_y_max, text_y_min)

PDF

tabula/schools.pdf

Screenshots

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions