This repository was archived by the owner on Apr 11, 2025. It is now read-only.
forked from camelot-dev/camelot
-
Notifications
You must be signed in to change notification settings - Fork 16
Index error on Hybrid Parser #252
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
Describe the bug
In some cases, there is an index error while using the Hybrid parser on a multipage pdf.
It is described and tested in #251
What is merged there is rather a workaround then a fix.
A it now fails gracefully.
Steps to reproduce the bug
See
pypdf_table_extraction/tests/test_network.py
Lines 145 to 157 in 35d8d20
def test_network_no_infinite_execution(testdir): | |
"""Test for not infinite execution. | |
This test used to fail, because the network parse was'nt able to process the tables on this pages. | |
After a refactor it stops infinite execution. But parsing result could be improved. | |
Hence this is no qualitative test. | |
""" | |
filename = os.path.join(testdir, "tabula/schools.pdf") | |
tables = camelot.read_pdf( | |
filename, flavor="network", backend="ghostscript", pages="4" | |
) | |
assert len(tables) >= 1 |
Expected behavior
Potential better fix would be to re-assemble the parts of the table detcted by the netwerk parser into the hybrid parser.
That part of the code also contained a TODO note from the original author.
pypdf_table_extraction/camelot/parsers/network.py
Lines 935 to 957 in 35d8d20
def _generate_columns_and_rows(self, bbox, user_cols): | |
# select elements which lie within table_bbox | |
self.t_bbox = text_in_bbox_per_axis( | |
bbox, self.horizontal_text, self.vertical_text | |
) | |
all_tls = list( | |
sorted( | |
filter( | |
lambda textline: len(textline.get_text().strip()) > 0, | |
self.t_bbox["horizontal"] + self.t_bbox["vertical"], | |
), | |
key=lambda textline: (-textline.y0, textline.x0), | |
) | |
) | |
text_x_min, text_y_min, text_x_max, text_y_max = bbox_from_textlines(all_tls) | |
# FRHTODO: | |
# This algorithm takes the horizontal textlines in the bbox, and groups | |
# them into rows based on their bottom y0. | |
# That's wrong: it misses the vertical items, and misses out on all | |
# the alignment identification work we've done earlier. | |
rows_grouped = self._group_rows(all_tls, row_tol=self.row_tol) | |
rows = self._join_rows(rows_grouped, text_y_max, text_y_min) |
tabula/schools.pdf
Screenshots
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed