Index error on Hybrid Parser

**Describe the bug**

In some cases, there is an index error while using the Hybrid parser on a multipage pdf.
It is described and tested in #251
What is merged there is rather a workaround then a fix.
A it now fails gracefully.





**Steps to reproduce the bug**

See https://github.com/py-pdf/pypdf_table_extraction/blob/35d8d208cc274569737ab6df40c1bfe112c956ab/tests/test_network.py#L145-L157

**Expected behavior**

Potential better fix would be to re-assemble the parts of the table detcted by the netwerk parser into the hybrid parser.
That part of the code also contained a TODO note from the original author.
https://github.com/py-pdf/pypdf_table_extraction/blob/35d8d208cc274569737ab6df40c1bfe112c956ab/camelot/parsers/network.py#L935-L957

**PDF**

tabula/schools.pdf

**Screenshots**

	def test_network_no_infinite_execution(testdir):
	"""Test for not infinite execution.

	This test used to fail, because the network parse was'nt able to process the tables on this pages.
	After a refactor it stops infinite execution. But parsing result could be improved.
	Hence this is no qualitative test.
	"""
	filename = os.path.join(testdir, "tabula/schools.pdf")
	tables = camelot.read_pdf(
	filename, flavor="network", backend="ghostscript", pages="4"
	)

	assert len(tables) >= 1

	def _generate_columns_and_rows(self, bbox, user_cols):
	# select elements which lie within table_bbox
	self.t_bbox = text_in_bbox_per_axis(
	bbox, self.horizontal_text, self.vertical_text
	)

	all_tls = list(
	sorted(
	filter(
	lambda textline: len(textline.get_text().strip()) > 0,
	self.t_bbox["horizontal"] + self.t_bbox["vertical"],
	),
	key=lambda textline: (-textline.y0, textline.x0),
	)
	)
	text_x_min, text_y_min, text_x_max, text_y_max = bbox_from_textlines(all_tls)
	# FRHTODO:
	# This algorithm takes the horizontal textlines in the bbox, and groups
	# them into rows based on their bottom y0.
	# That's wrong: it misses the vertical items, and misses out on all
	# the alignment identification work we've done earlier.
	rows_grouped = self._group_rows(all_tls, row_tol=self.row_tol)
	rows = self._join_rows(rows_grouped, text_y_max, text_y_min)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index error on Hybrid Parser #252

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Index error on Hybrid Parser #252

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions