-
-
Notifications
You must be signed in to change notification settings - Fork 352
Text dropped in table after setting include_formatting=True #829
Copy link
Copy link
Open
Description
content in table was dropped if it contains a format tag and enabled include_formatting
example:
<table>
<tr>
<td>
<p><b>GPT-5.4</b></p>
</td>
</tr>
</table>will become
<table>
<row>
<cell> <p/> </cell>
</row>
</table>I don't have any idea why this happenedso I ask GLM-5 and it said this:
When include_formatting=True, trafilatura keeps format tags such as <b><strong><i> and converts them to the internal> <hi rend="#b">
format. The problem is its strip_tags() process:
1. Normal Flow (include_formatting=False):
- <td><p><b>GPT-5.4</b></p></td> → <cell><p>GPT-5.4</p></cell> ✅
2. Question flow (include_formatting=True):
- <td><p><b>GPT-5.4</b></p></td>
- → Convert to <cell><p><hi rend="#b">GPT-5.4</hi></p></cell>
- → In a cleanup/merge step,<hi> the text inside was incorrectly handled
- → results become <cell><p></p></cell> ❌
Since I hope to use it to extract webpage content and made them an ebook to read, I hope I could keep formatting and corect table structure.
Now I can set include_formatting=False to fix the problem but it's not perfect.
I would like to help if anyone could tell me how to fix it.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels