Skip to content

Text dropped in table after setting include_formatting=True #829

@SnowFox4004

Description

@SnowFox4004

content in table was dropped if it contains a format tag and enabled include_formatting
example:

<table>
    <tr>
        <td>
            <p><b>GPT-5.4</b></p>
        </td>
    </tr>
</table>

will become

<table>
  <row>
    <cell> <p/> </cell>
  </row>
</table>

I don't have any idea why this happenedso I ask GLM-5 and it said this:


When include_formatting=True, trafilatura keeps format tags such as <b><strong><i>  and converts them to the internal> <hi rend="#b">
  format. The problem is its strip_tags() process:

1. Normal Flow (include_formatting=False):
      - <td><p><b>GPT-5.4</b></p></td> → <cell><p>GPT-5.4</p></cell> ✅

2. Question flow (include_formatting=True):
      - <td><p><b>GPT-5.4</b></p></td>
      - → Convert to <cell><p><hi rend="#b">GPT-5.4</hi></p></cell>
      - → In a cleanup/merge step,<hi> the text inside was incorrectly handled
      - → results become <cell><p></p></cell> ❌

Since I hope to use it to extract webpage content and made them an ebook to read, I hope I could keep formatting and corect table structure.
Now I can set include_formatting=False to fix the problem but it's not perfect.
I would like to help if anyone could tell me how to fix it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions